The Language Modeling Pipeline¶

Our goal is to build strong base and instruction-tuned language models. There are many components of this pipeline, all of which are represented in Marin:

Curating raw sources (e.g., HTML, PDF, etc.)
Crawling the web for additional raw sources
Converting raw sources into text
Training quality classifers
Filtering the raw data with those classifiers
Performing deduplication to produce clean data
Tokenization the clean data for training
Training a model on the tokenized data
Evaluating the resulting model

Currently, we leverage the following open-source tools (thanks to the authors for making them!):

For transforming HTML into text, we use a variety of tools including trafilatura and resiliparse.
For data filtering, we use fastText.
For model training, it uses Levanter, a Jax-based framework that's legible, scalable, and reproducible.
For model evaluation, we use lm-evaluation-harness.

Note that the Marin framework is agnostic to these choices and we can support other tools.

Where possible, we use the same data formats as Dolma. Where not possible, we try to use "natural" extensions that stick to the spirit of the format.

The Integration test provides a mini-version of all the steps. To run the integration test (which should finish in less than 10 minutes, and doesn't require a GPU/TPU), run:

JAX_TRACEBACK_FILTERING=off PYTHONPATH=. python tests/integration_test.py --prefix var