Marin¶

"I am not afraid of storms, for I am learning how to sail my ship."
– Louisa May Alcott

Marin is an open-source framework for the research and development of foundation models.

A key feature of Marin is reproducibility: every step, from raw data to the final model are recorded, not just the end result. This includes failed experiments, so the entire research process is transparent.

Marin's primary use case is training language model like Llama, DeepSeek, Qwen, etc. Notably, this includes data curation, transformation, filtering, tokenization, training, and evaluation.

We used Marin to train the first open-source 8B parameter model to outperform Llama 3.1 8B. You can see the training script or read the retrospective.

Quick Links¶

Documentation Structure¶

Our documentation is organized into the following main sections:

Tutorials: Step-by-step guides to help you get started with Marin, including installation, basic usage, and local GPU setup
Explanation: Background information and context about the project
Experiment Reports: Reports from our experiments
Developer Guide: Information for developers who want to contribute to Marin
Technical Reference: Detailed technical information about Marin's architecture and components

These sections are available on the left side bar (or hamburger menu).

Get Involved¶

To get started with Marin:

Install Marin.
Train a tiny language model using Marin.
See how to run a much larger DCLM 1B/1x experiment using Marin.
See a summary of the experiments we've run.
Join the Marin Discord to chat with the community.

Get Help¶

If you have any questions or need help, please feel free to reach out to us on Discord or open an issue on GitHub.