Skip to content

Setting up a Local GPU Environment

This guide will walk you through the steps to set up a local GPU environment for Marin. By "local", we mean a machine that you run jobs on directly, as opposed to dispatching them to a shared cluster via Iris. Similar steps will let you run Marin on a cloud GPU environment under Iris (the Marin team runs production GPU workloads on CoreWeave), but we defer that to a future guide.

Prerequisites

Make sure you've followed the installation guide to do the basic installation.

In addition to the prerequisites from the basic installation, we have one GPU-specific system dependency:

  • NVIDIA driver 580 or newer

We assume you are running Ubuntu 24.04.

NVIDIA driver and runtime

Install an NVIDIA driver that supports CUDA 13. Verify that the driver is at least 580 and that nvidia-smi reports CUDA 13.x:

nvidia-smi

Marin uses JAX as a core library. The gpu extra installs the CUDA 13 JAX runtime, including CUDA, cuDNN, and NCCL Python wheels:

uv sync --extra=gpu

If you install a local CUDA toolkit for custom kernels, use CUDA 13 and keep older CUDA libraries out of LD_LIBRARY_PATH so they do not override the JAX wheel libraries.

See JAX's installation guide for more options.

Tip

If you are using a DGX Spark or similar machine with unified memory, you may need to dramatically reduce the memory that XLA preallocates for itself. You can do this by setting the XLA_PYTHON_CLIENT_MEM_FRACTION variable, to something like 0.5:

export XLA_PYTHON_CLIENT_MEM_FRACTION=0.5
You can also set this in your `.bashrc` or `.zshrc` file.
```bash
echo 'export XLA_PYTHON_CLIENT_MEM_FRACTION=0.5' >> ~/.bashrc
```

For broader JAX/Levanter memory tuning (sharding, checkpointing, offloading), see [Making Things Fit in HBM](../references/hbm-optimization.md).

Running an Experiment

Now you can run an experiment. Let's start by running the tiny model training script (GPU version) experiments/tutorials/train_tiny_model_gpu.py:

export MARIN_PREFIX=local_store
export WANDB_ENTITY=...
uv run python experiments/tutorials/train_tiny_model_gpu.py --prefix local_store

The prefix is the directory where the output will be saved. It can be a local directory or anything fsspec supports, such as s3:// or gs://.

Let's take a look at the script. Whereas the CPU version requests resources=ResourceConfig.with_cpu(), the GPU version requests resources=ResourceConfig.with_gpu(...):

from fray.cluster import ResourceConfig

nano_train_config = SimpleTrainConfig(
    # Here we define the hardware resources we need.
    resources=ResourceConfig.with_gpu("H100", count=8, cpu=32, disk="128G", ram="128G"),
    train_batch_size=256,
    num_train_steps=100,
    learning_rate=6e-4,
    weight_decay=0.1,
)

To scale up, submit to Marin's shared Iris cluster via uv run iris --cluster=marin job run ... (see lib/iris/OPS.md for the CLI reference).