Submitting to the Speedrun Leaderboard¶
Marin Speedrun, inspired by the nanogpt Speedrun, is a benchmark aimed at improving the compute efficiency of language model training, by incentivizing researchers to develop new model architectures, optimizers, and training strategies. The idea is to allow participants to experiment at a smaller scale, where compute is less of a bottleneck, before scaling things up.
See the overview of Marin Speedrun for a more detailed explanation of speedrun. This tutorial assumes you are familiar with the idea of a speedrun, and will show you how to submit a speedrun to the leaderboard.
Submitting to the Speedrun Leaderboard consists of the following steps:
- Define a configuration for training a model.
- Train it on your hardware; both GPUs and TPUs are supported. It is possible to create a good speedrun on just one/few GPUs.
- Submit the configuration and the generated results file to the Marin GitHub repository via a pull request.
- Watch it appear on the leaderboard once merged!
Here is an hello world submission, and here is the current leaderboard. You can explore the code for different runs here. We will now use the hello-world example to show you how to submit a speedrun to the leaderboard.
Prerequisites¶
Before you get started, you will need the following:
- Basic installation
- Set up your local GPU
- You need to make sure that your Ray GPU cluster is started.
GPU Setup
If you are working on GPU, make sure you've correctly installed and set up the environment by following the local GPU setup guide. This ensures your environment is properly configured for training with GPU.
Creating Your Speedrun Submission¶
-
Create a new directory for your run:
-
Create your training script in this directory- you'll need to specify congfigs related to author/affiliation, model configuration, and training configuration (as part of which you should also specify what kind of hardware you're using via
GpuConfig
orTpuPodConfig
):import logging from experiments.llama import llama_nano from experiments.simple_train_config import SimpleTrainConfig from marin.execution.executor import executor_main from marin.resources import GpuConfig from marin.speedrun.speedrun import Author, SpeedrunConfig, default_speedrun logger = logging.getLogger("ray") speedrun_config = SpeedrunConfig( author=Author( name="your name", affiliation="your affiliation", url="your url", ), description="your description", model_config=llama_nano, train_config=SimpleTrainConfig( GpuConfig(gpu_count=1, accelerator_type="A100"), train_batch_size=32, num_train_steps=100, learning_rate=3e-3, weight_decay=0.1, steps_per_eval=500, ), # tokenized_dataset need not be set here; by default it will be set to a FineWeb-Edu 10B token dataset # that we have pre-tokenized and shuffled, available at https://huggingface.co/datasets/marin-community/fineweb-edu-pretokenized-10B # tokenized_dataset=fineweb_edu_subcache_10B, ) # Shows your speedrun configuration, model FLOPs, model size and (training) hardware FLOPs- you # can use this before actually kicking off a run to validate your setup speedrun_config.print_run_info() if __name__ == "__main__": executor_main(steps=default_speedrun("llama_nano_gpu_speedrun", speedrun_config))
import logging from experiments.llama import llama_nano from experiments.simple_train_config import SimpleTrainConfig from marin.execution.executor import executor_main from marin.resources import TpuPodConfig from marin.speedrun.speedrun import Author, SpeedrunConfig, default_speedrun logger = logging.getLogger("ray") speedrun_config = SpeedrunConfig( author=Author( name="your name", affiliation="your affiliation", url="your url", ), description="Nano model based on Llama architecture.", model_config=llama_nano, train_config=SimpleTrainConfig( TpuPodConfig(tpu_type="v4-8"), train_batch_size=512, num_train_steps=3000, learning_rate=3e-3, weight_decay=0.1, steps_per_eval=1000, ), # tokenized_dataset need not be set here; by default it will be set to a FineWeb-Edu 10B token dataset # that we have pre-tokenized and shuffled, available at https://huggingface.co/datasets/marin-community/fineweb-edu-pretokenized-10B # tokenized_dataset=fineweb_edu_subcache_10B, ) speedrun_config.print_run_info() if __name__ == "__main__": executor_main(steps=default_speedrun("llama_nano_tpu_speedrun", speedrun_config))
Feel free to go through each of the configuration classes' definitions to get an idea of the design space here. The training configuration can either be a
SimpleTrainConfig
orTrainLmOnPodConfig
. Also, note that in this example, you only have one Python file, but if your submission is more complex, you can split it into multiple files.To see examples of other speedruns and configurations, check out the speedrun directory. You can also add new optimizers, change learning rate schedules, play with hyperparameters, etc.
-
Before running your speedrun, you may find it helpful to call speedrun_config.print_run_info() to see the estimated training HW FLOPs, model FLOPs, model size, and your speedrun configs displayed. This will help you sanity-check your run, and also can be helpful in estimating the resources needed for your run.
-
Set relevant environment variables needed for the run:
export WANDB_API_KEY=... export HF_TOKEN=... export MARIN_PREFIX=... export WANDB_ENTITY=... export WANDB_PROJECT=...
-
MARIN_PREFIX
tells Marin where to put artifacts generated during the execution of any steps. For examples, training checkpoints usually will be written to${MARIN_PREFIX}/checkpoints/
. You can set this to an fsspec-recognizable path eg. a GCS bucket, or a directory on your machine. (For a detailed explanation ofMARIN_PREFIX
and other ways to specify the output path, see UnderstandingMARIN_PREFIX
and--prefix
.) -
WANDB_ENTITY
andWANDB_PROJECT
are used to specify the Weights and Biases team/entity and project names. You may use your own organization and project identifiers, but we ask that the run link corresponding to your speedrun is publicly accessible when submitting to the leaderboard.
-
-
Train your model:
Submitting Your Run¶
We walked through a hello-world example above, but if you do want to submit your run, follow these steps:
-
Add the resulting
speedrun_results.json
file to your run directory: -
Create a pull request including:
- Your run directory (training script and results file)
- A brief explanation of your approach (model architecture, training strategy, optimizations)
-
Once reviewed and merged, the leaderboard gets updated, and your run will appear on the public leaderboard.
Speedrun Time Benchmarks
To give you a sense of training times on GPU, a 50M parameter model trained on 4× A100 GPUs (80GB) takes about 33 minutes to complete 7600 steps, processing approximately 1 billion tokens. This should help you estimate the resources needed for your own speedrun experiments.
Helpful Resources¶
- To see examples of other speedruns, check out the speedrun directory.
- Check out the leaderboard.
- Find out more about speedrun itself in the Speedrun explanation.