Default Pipeline Steps¶
Marin comes with a set of default pipeline steps that can be used to build experiments.
These steps are defined in experiments.defaults and are intended to be used as building blocks for experiments.
In general, you should reach for the default steps before writing your own.
Downloading¶
default_download ¶
default_download(
name: str,
hf_dataset_id: str,
revision: str | None = None,
override_output_path: str | None = None,
**kwargs: Any
) -> InputName
Download a HuggingFace dataset and upload it to a specified path with default configuration.
Parameters:
-
name(str) –The name of the Download step. It forms the basis of the output path unless override_output_path is explicitly specified.
-
hf_dataset_id(str) –Hugging Face source. Either
$ORG/$DATASETon HF Hub orhf://buckets/.... -
revision(str | None, default:None) –The revision of the dataset to download for Hub datasets. Optional for bucket paths.
-
override_output_path(str | None, default:None) –Optional. The output path for the dataset.
-
**kwargs(Any, default:{}) –Additional keyword arguments that are passed to the download config.
The final output data will reside in '{output_path}/{revision}'.
Exporting and Uploading¶
upload_dir_to_hf ¶
upload_dir_to_hf(
input_path: str | InputName | ExecutorStep,
repo_id: str,
repo_type: str = "dataset",
token: str | None = None,
certificate_path: str | None = None,
private: bool = False,
revision: str | None = None,
commit_batch_size: str = "1GiB",
**upload_kwargs: str
) -> ExecutorStep
Uploads a path (possibly a GCS path) to a Hugging Face repo. For local paths, it will use the huggingface_hub.upload_folder function. For GCS (or other fsspec paths), it will stream the files using preupload_lfs_files and/or upload_folder
Parameters:
-
input_path(str | InputName | ExecutorStep) –path to upload (can be a GCS path)
-
repo_id(str) –the repo id to upload to (e.g. "username/repo_name")
-
repo_type(str, default:'dataset') –the type of repo to upload to (e.g. "dataset", "model", etc.)
-
token(str | None, default:None) –the token to use for authentication (if not provided, it will use the default token)
-
revision(str | None, default:None) –the branch to upload to (if not provided, it will use the default branch)
-
certificate_path(str | None, default:None) –where to store the certificate that we uploaded to HF (needed for executor idempotency). If not provided, a reasonable default will be used. Should be a path relative to the executor prefix.
Returns: ExecutorStep
Tokenization¶
default_tokenize ¶
default_tokenize(
name: str,
dataset: InputName | ExecutorStep | str | HfDatasetSpec,
tokenizer: str,
format: LmDatasetFormatBase = TextLmDatasetFormat(),
*,
sample_count: int | VersionedValue[int] | None = None,
is_validation: bool = False,
levanter_batch_size: int | None = None,
tags: Sequence[str] = (),
resources: ResourceConfig | None = None,
worker_resources: ResourceConfig | None = None
) -> ExecutorStep
Tokenizes a dataset using the specified tokenizer and Levanter's tokenization infrastructure.
Parameters:
-
name(str) –The name of the tokenized dataset. This is used to form the output path for the executor step.
tokenized/will be prepended to the name. -
dataset(InputName | ExecutorStep | str | HfDatasetSpec) –The dataset to tokenize. This can be an InputName, ExecutorStep, a string as a path to the dataset or a HuggingFace dataset ID, or
HfDatasetSpecto specify a dataset with a particular subset name. -
tokenizer(str) –string HuggingFace tokenizer name. Should be the same as you intend to use in the tokenizer spec for the training run.
-
format(LmDatasetFormatBase, default:TextLmDatasetFormat()) –The format of the dataset. This is used to determine how to tokenize the data.
See Levanter's documentation for more details.
-
sample_count(int | VersionedValue[int] | None, default:None) –Optional limit on the number of samples to tokenize per shard. If
None, tokenize everything. -
is_validation(bool, default:False) –Whether the dataset is a validation set. Doesn't do anything for HF datasets.
-
tags(Sequence[str], default:()) –Tags to attach to the Levanter dataset source for tagged evaluation.
Returns: An ExecutorStep that represents the tokenized dataset.
Training¶
default_train ¶
default_train(
name: str,
tokenized: (
InputName | ExecutorStep | LMMixtureDatasetConfig
),
model_config: LmConfig,
train_config: SimpleTrainConfig,
tags: Sequence[str] = (),
use_default_validation: bool = True,
eval_harness_tasks: Sequence[
EvalTaskConfig
] = CORE_TASKS,
wandb_name: str | None = None,
wandb_group: str | None = None,
override_output_path: str | None = None,
) -> ExecutorStep
Train a language model using the default configuration.
Parameters:
-
name(str) –The name of the training run. Will form the basis of the output path for the executor step.
-
tokenized(InputName | ExecutorStep | LMMixtureDatasetConfig) –The tokenized data to train on. This can be an InputName, ExecutorStep, or LMMixtureDatasetConfig.
-
model_config(LmConfig) –Levanter LmConfig for the model to train.
-
train_config(SimpleTrainConfig) –SimpleTrainConfig for the training run.
-
tags(Sequence[str], default:()) –Any additional tags to add to the Wandb tracker.
-
use_default_validation(bool, default:True) –Whether to use the default validation sets (currently Paloma).
-
eval_harness_tasks(Sequence[EvalTaskConfig], default:CORE_TASKS) –List of evaluation harness tasks. Defaults to the CORE set of tasks. Use () or [] to disable
-
wandb_name(str | None, default:None) –Optional W&B display name for this run. Defaults to W&B's auto-generated name.
-
wandb_group(str | None, default:None) –Optional W&B group to organize related runs (e.g., a sweep). If unset, defaults to $WANDB_GROUP.
default_sft ¶
default_sft(
name: str,
tokenized: (
InputName | ExecutorStep | LMMixtureDatasetConfig
),
model_config: LlamaConfig,
sft_config: SimpleSFTConfig,
tags: Sequence[str] = (),
) -> ExecutorStep
Creates an ExecutorStep for supervised fine-tuning of a language model.
This function provides a unified interface for both single-dataset SFT and mixture-based SFT with a simplified configuration approach.
Parameters:
-
name(str) –The name of the training run, forms the basis of the output path.
-
tokenized(InputName | ExecutorStep | LMMixtureDatasetConfig) –The tokenized data to train on: - For single dataset: an InputName or ExecutorStep for a tokenized dataset. - For mixture: a LMMixtureDatasetConfig with multiple datasets.
-
model_config(LlamaConfig) –Levanter LlamaConfig for the model architecture to train.
-
sft_config(SimpleSFTConfig) –Configuration for the SFT training process.
-
tags(Sequence[str], default:()) –Additional tags for WandB logging. Default: ().
Returns:
-
ExecutorStep–An ExecutorStep configured for supervised fine-tuning.
simulated_epoching_train ¶
simulated_epoching_train(
name: str,
tokenized: (
InputName | ExecutorStep | LMMixtureDatasetConfig
),
model_config: LmConfig,
train_config: SimpleTrainConfig,
target_budget: int,
tags: Sequence[str] = (),
use_default_validation: bool = True,
eval_harness_tasks: Sequence[
EvalTaskConfig
] = CORE_TASKS,
) -> ExecutorStep
Simulates the number of epochs seen in a full training run by sub-sampling individual datasets. Otherwise, operates the same as default_train.
Parameters:
-
name(str) –The name of the training run. Will form the basis of the output path for the executor step.
-
tokenized(InputName | ExecutorStep | LMMixtureDatasetConfig) –The tokenized data to train on. This can be an InputName, ExecutorStep, or LMMixtureDatasetConfig.
-
model_config(LmConfig) –Levanter LmConfig for the model to train.
-
train_config(SimpleTrainConfig) –SimpleTrainConfig for the training run.
-
target_budget(int) –Target token budget to simulate.
-
tags(Sequence[str], default:()) –Any additional tags to add to the Wandb tracker.
-
use_default_validation(bool, default:True) –Whether to use the default validation sets (currently Paloma).
-
eval_harness_tasks(Sequence[EvalTaskConfig], default:CORE_TASKS) –List of evaluation harness tasks. Defaults to the CORE set of tasks. Use () or [] to disable
Evaluation¶
default_eval ¶
default_eval(
step: ExecutorStep | InputName | str,
resource_config: ResourceConfig = with_tpu("v4-8"),
evals: list[EvalTaskConfig] | None = None,
max_eval_instances: int | None = None,
apply_chat_template: bool = False,
discover_latest_checkpoint: bool = True,
) -> ExecutorStep
Create an ExecutorStep to evaluate the model using LM Evaluation Harness on a step.
Parameters:
-
step(ExecutorStep | InputName) –step to evaluate.
-
evals(list[EvalTaskConfig], default:None) –List of evals to run- defaults to a set of CORE_TASKS defined in task_configs.py
-
max_eval_instances(int, default:None) –Maximum number of evaluation instances to run.