Marin (internal)¶
This documentation is for internal developers.
Prerequisites¶
- Please read the general guidelines in guidelines.md
- Complete the environment setup in installation.md
Setup¶
Behind the scenes, we run an instance of Marin on Google Cloud Platform (GCP).
Ensure that someone (e.g. @dlwh) adds you to the hai-gcp-models project on GCP
as a Marin Dev. Make sure to install gcloud and then run:
gcloud auth login
gcloud auth application-default login
gcloud config set account <your-email-account>
gcloud config set project hai-gcp-models
# [Verification] Should show a [core] entry with `account = <your-email-account` and `project = hai-gcp-models`
gcloud config list
# [Verification] Should not throw a permission error
gcloud storage ls gs://marin-us-central2
make dev_setup
If you don't have permissions for hai-gcp-models or you run into permissions
issues, contact David Hall for help!
Iris Cluster + Job Submission¶
Once authenticated for GCP, all other work happens through our
Iris cluster. Cluster config templates live
under lib/iris/config/:
marin.yaml— production TPU cluster on GCPmarin-dev.yaml— dev TPU cluster on GCP (smaller scale caps)coreweave.yaml— GPU cluster on CoreWeave
Iris uses these configs as the single source of truth for cluster operations -- controller URL, SSH tunneling, and
auth are all derived from the config file you pass with --config. You don't need to manage SSH keys, head-node IPs,
or dashboard ports yourself.
For day-to-day debugging and operating live clusters, see
lib/iris/OPS.md.
Connecting to the Dashboard and Submitting Jobs¶
There are two steps necessary for 1) establishing a connection to the cluster and 2) submitting/monitoring jobs on the
cluster. You will need at least two terminal processes running for the following steps (make sure to activate your
marin Python environment as well):
# [Terminal 1] Open SSH tunnel, print the dashboard URL, and block.
uv run iris --cluster=marin cluster dashboard
# [Browser] Navigate to the URL printed above.
# Clicking a job opens its per-task/log view.
The tunnel lives as long as this process does — keep the terminal open. (If you're familiar with the NLP SLURM
workflow, think of this as your connection to sc.)
To submit jobs, use iris job run. Job IDs are canonical paths of the form /<user>/<job-name>-<timestamp>:
# [Terminal 2] Submit a job and return immediately.
# =>> Will print a line like `Job submitted: /<user>/hello_world-20260420-120000`
uv run iris --cluster=marin job run \
--no-wait --extra marin:tpu --tpu v5litepod-16 \
-- python experiments/tutorials/hello_world.py
# List jobs (filter by --state, --user, --prefix, etc).
uv run iris --cluster=marin job list
# Follow logs (includes child-job task logs by default).
uv run iris --cluster=marin job logs /<user>/<job-name>
# Kill / Stop Job (if necessary / error / bug) -- kills the job and all its child tasks.
uv run iris --cluster=marin job stop /<user>/<job-name>
Notes:
--extra marin:tpuinstalls the Marin TPU deps into the task container; use--extra marin:cpufor CPU-only entrypoints. On CoreWeave,--gpurequests hardware and--extra gpurequests the Python deps; seelib/iris/OPS.mdfor current request names.- Request TPU hardware with
--tpu v5litepod-16(or similar).--reserveonly holds capacity for scheduling; it does not attach accelerator devices to the task container. WANDB_API_KEY,HF_TOKEN,HF_DATASETS_TRUST_REMOTE_CODE, andTOKENIZERS_PARALLELISMare auto-injected from your shell. Pass other env vars with-e KEY VALUE(two positional args — quote$VALUEif it may be unset).- See
iris job run --helpandlib/iris/OPS.mdfor the full flag list (--memory,--job-name, priority bands, etc.).
Forking External Packages¶
See Forking Policy for our requirements on maintaining
forked dependencies under marin-community/.
Precommit¶
./infra/pre-commit.py --all-files --fix