Marin¶
"I am not afraid of storms, for I am learning how to sail my ship."
– Louisa May Alcott
Marin is an open-source framework for the research and development of foundation models.
A key feature of Marin is reproducibility: every step, from raw data to the final model are recorded, not just the end result. This includes failed experiments, so the entire research process is transparent.
Marin's primary use case is training language model like Llama, DeepSeek, Qwen, etc. Notably, this includes data curation, transformation, filtering, tokenization, training, and evaluation.
We used Marin to train the first open-source 8B parameter model to outperform Llama 3.1 8B. You can see the training script or read the retrospective.
Quick Links¶
Documentation Structure¶
Our documentation is organized into the following main sections:
- Tutorials: Step-by-step guides to help you get started with Marin, including installation, basic usage, and local GPU setup
- Explanation: Background information and context about the project
- Experiment Reports: Reports from our experiments
- Developer Guide: Information for developers who want to contribute to Marin
- Technical Reference: Detailed technical information about Marin's architecture and components
These sections are available on the left side bar (or hamburger menu).
Get Involved¶
To get started with Marin:
- Install Marin.
- Train a tiny language model using Marin.
- See how to run a much larger DCLM 1B/1x experiment using Marin.
- See a summary of the experiments we've run.
- Participate in the Marin Speedrun competition to try to find the most efficient way to train a language model.
- Try out the Marin Datashop to contribute and create data for your use case.
- Join the Marin Discord to chat with the community.
Get Help¶
If you have any questions or need help, please feel free to reach out to us on Discord or open an issue on GitHub.