Default Pipeline Steps¶

Marin provides a set of standard builders for the common stages of an LM experiment: data download, tokenization, mixture assembly, training, and evaluation. Reach for these before writing custom step code.

All builders return lazy ArtifactStep[T] handles (e.g. ArtifactStep[TokenizedCache] or ArtifactStep[LevanterCheckpoint]) that StepRunner materializes on demand.

Download¶

hf_download ¶

hf_download(
    name: str,
    *,
    hf_id: str,
    revision: str,
    version: str,
    urls_glob: Sequence[str] = (),
    pin: str | None = None,
    resources: ResourceConfig | None = None
) -> ArtifactStep[Artifact]

A HuggingFace-Hub dataset download as a raw-data handle.

Wraps :func:marin.datakit.download.huggingface.download_hf into a handle that :func:tokenized (via raw=) or :func:marin.execution.lazy.apply can depend on. urls_glob restricts which files in the repo are fetched (empty = all). pin references an existing download at a fixed location instead of re-fetching it.

raw_download ¶

raw_download(
    name: str,
    *,
    fn: Callable[[object], object],
    build_config: Callable[[StepContext], object],
    version: str,
    pin: str | None = None,
    resources: ResourceConfig | None = None
) -> ArtifactStep[Artifact]

A raw-data download as an ArtifactStep[Artifact] that :func:tokenized can depend on.

The generic download builder for a source that is not a HuggingFace-Hub dataset (use :func:hf_download for that): fn(build_config(ctx)) writes the download to ctx.output_path. Returned as a raw :class:~marin.execution.artifact.Artifact (not a tokenized cache). pin references an existing download instead of re-fetching it.

Tokenization¶

tokenized ¶

tokenized(
    name: str,
    *,
    tokenizer: str,
    version: str,
    source: str | None = None,
    paths: Sequence[str] | None = None,
    raw: ArtifactStep[Artifact] | None = None,
    glob: str | None = None,
    validation: bool = False,
    pin: str | None = None,
    text_key: str = "text",
    sample_count: int | None = None,
    tags: Sequence[str] = (),
    resources: ResourceConfig | None = None
) -> ArtifactStep[TokenizedCache]

A tokenized-dataset handle.

Provide exactly one raw input: source (a HuggingFace id org/name or a single raw path), paths (raw globs resolved against the run prefix), or raw + glob (a download handle and a subpath glob within it). validation=True routes the data to the cache's validation split. sample_count caps the documents tokenized per shard (it bears identity — a sampled cache differs from the full one). pin references already-tokenized data at an existing location instead of recomputing it.

pretokenized ¶

pretokenized(
    name: str,
    *,
    repo_id: str,
    tokenizer: str,
    version: str,
    revision: str | None = None,
    pin: str | None = None,
    tags: Sequence[str] = (),
    resources: ResourceConfig | None = None
) -> ArtifactStep[TokenizedCache]

A handle to an already-tokenized Levanter cache hosted on HuggingFace.

build_config(ctx) downloads the HF dataset repo repo_id into ctx.output_path as a Levanter cache; the handle then reads as a TokenizedCache with no re-tokenization. Use it where a tokenizing :func:tokenized handle would be too slow — e.g. the fineweb-edu prebuilt subcaches. pin references an already-downloaded cache at an existing location instead of fetching it again.

Mixture assembly¶

mixture ¶

mixture(
    ctx: StepContext,
    train: Mapping[ArtifactStep[TokenizedCache], float],
    *,
    validation: Sequence[ArtifactStep[TokenizedCache]] = (),
    shuffle: (
        bool | BlockShuffleConfig
    ) = DEFAULT_LM_DATA_SHUFFLE
) -> LmDataConfig

Assemble an LmDataConfig from dataset handles.

train maps each handle to its mixture weight; validation handles are added at weight 0. The component key is the handle's name (two handles sharing a name are rejected). At run time each component is built from its TokenizedCache record (tokenizer/format/path), never from the producing recipe — so adopted and pinned caches work the same as freshly tokenized ones. At fingerprint time (no records yet) the data contribution is the sorted {name@version: weight} map; the tokenizer is determined by the chosen datasets and verified at run time. Call this inside a consumer's build_config and pass the same handles as the step's deps so they materialize first.

Training¶

train_lm ¶

train_lm(
    *,
    name: str,
    model: LmConfig,
    optimizer: OptimizerConfig,
    datasets: Mapping[ArtifactStep[TokenizedCache], float],
    batch_size: int,
    seq_len: int,
    num_train_steps: int,
    z_loss_weight: float | None,
    evals: EvalSuite | None,
    resources: ResourceConfig,
    version: str,
    validation: Sequence[ArtifactStep[TokenizedCache]] = (),
    init_from: (
        ArtifactStep[LevanterCheckpoint] | None
    ) = None,
    mp: str = MARIN_PRECISION,
    tensor_parallel_size: int = 1,
    steps_per_eval: int = 1000,
    wandb_project: str = "marin",
    wandb_group: str | None = None,
    run_id: str | None = None,
    tags: Sequence[str] = (),
    env_vars: dict[str, str] | None = None
) -> ArtifactStep[LevanterCheckpoint]

Assemble a language-model training run as an ArtifactStep[LevanterCheckpoint].

The required arguments are the run's identity-bearing decisions; the helper defaults none of them. datasets maps each tokenized-dataset handle to its mixture weight, and validation lists handles to add at weight 0; train_lm assembles the :func:~marin.experiment.data.mixture internally and derives the step's deps from those handles, so they materialize first and the data config cannot desync from the dependencies. evals=None opts out of harness evals explicitly — there is no implicit default suite.

The remaining parameters are execution choices that do not define the experiment: mp (the standard marin precision, identity-bearing but universal), tensor_parallel_size (model sharding width), eval/checkpoint cadence, tracker metadata, and resources (the TPU the job is dispatched onto — a runtime arg, so it never enters the checkpoint's fingerprint). init_from chains this run onto another checkpoint (it becomes a dep and seeds initialize_from_checkpoint_path).

A mutable (dev) version namespaces the checkpoint per user — its name becomes users/{username}/{name} so concurrent authors of the same experiment do not clobber each other; a fixed (calendar) version keeps the shared name.

Evaluation¶

default_eval ¶

default_eval(
    step: ArtifactStep[LevanterCheckpoint],
    resource_config: ResourceConfig = with_tpu("v4-8"),
    evals: list[EvalTaskConfig] | None = None,
    max_eval_instances: int | None = None,
    apply_chat_template: bool = False,
    discover_latest_checkpoint: bool = True,
) -> ArtifactStep[LevanterEvalResult]

Create an eval artifact for the model using LM Evaluation Harness on a step.

Parameters:

step (ArtifactStep[LevanterCheckpoint]) –

LevanterCheckpoint handle to evaluate. Wrap a pre-existing checkpoint path with ArtifactStep.adopt(name, version, path, kind=LevanterCheckpoint).
evals (list[EvalTaskConfig] | None, default: None ) –

List of evals to run. Defaults to CORE_TASKS.
max_eval_instances (int | None, default: None ) –

Maximum number of evaluation instances to run.

default_key_evals ¶

default_key_evals(
    step: ArtifactStep[LevanterCheckpoint],
    resource_config: ResourceConfig,
    model_name: str | None = None,
    max_eval_instances: int | None = None,
    engine_kwargs: (
        dict | None
    ) = DEFAULT_LM_EVAL_MODEL_KWARGS,
) -> list[ArtifactStep]

Create a list of eval artifacts for the model using LM Evaluation Harness.