Running Evaluations with Marin¶
This guide shows the current evaluation entrypoints in Marin. For a high-level overview of the evaluation stack, see Evaluation Overview.
Prerequisites¶
- A trained model checkpoint in Hugging Face format, or an existing
ExecutorStepfrom training. - Access to the TPU or GPU resources required by the evaluator you choose.
Core APIs¶
The canonical helpers live in experiments/evals/evals.py:
from experiments.evals.evals import (
default_eval,
default_key_evals,
evaluate_lm_evaluation_harness,
evaluate_levanter_lm_evaluation_harness,
)
default_evalrunsCORE_TASKSthrough the Levanter LM evaluation harness by default.default_key_evalsreturns the current "key evals" bundle: one generation step overKEY_GENERATION_TASKSand one multiple-choice step overKEY_MULTIPLE_CHOICE_TASKS.evaluate_lm_evaluation_harnessis the lower-level helper for custom vLLM-backed LM-eval runs.evaluate_levanter_lm_evaluation_harnessis the lower-level helper for Levanter-backed evaluation runs.
Task sets are defined in experiments/evals/task_configs.py. The most commonly used ones are:
CORE_TASKSCORE_TASKS_PLUS_MMLUKEY_GENERATION_TASKSKEY_MULTIPLE_CHOICE_TASKS
1. Run CORE_TASKS¶
Use default_eval when you want the default multiple-choice evaluation suite:
from fray.cluster import ResourceConfig
from experiments.evals.evals import default_eval
from marin.execution.executor import executor_main
model_path = "gs://marin-us-east5/gcsfuse_mount/perplexity-models/llama-200m"
core_eval_step = default_eval(
step=model_path,
resource_config=ResourceConfig.with_tpu("v4-8"),
# Optional overrides:
# evals=CORE_TASKS_PLUS_MMLU,
# max_eval_instances=100,
)
if __name__ == "__main__":
executor_main(steps=[core_eval_step])
default_evalaccepts a checkpoint path, anExecutorStep, or anInputName.- To include MMLU in this path, pass
evals=CORE_TASKS_PLUS_MMLU.
2. Run the Current Key-Evals Bundle¶
Use default_key_evals for the repository's current key-eval bundle:
from fray.cluster import ResourceConfig
from experiments.evals.evals import default_key_evals
from marin.execution.executor import executor_main
model_path = "gs://marin-us-east5/gcsfuse_mount/perplexity-models/llama-200m"
key_steps = default_key_evals(
step=model_path,
resource_config=ResourceConfig.with_tpu("v6e-8"),
model_name="my_key_evals",
# max_eval_instances=50,
)
if __name__ == "__main__":
executor_main(steps=key_steps)
Today, default_key_evals returns two ExecutorSteps:
- A generation run over
KEY_GENERATION_TASKSusingevaluate_lm_evaluation_harness. - A multiple-choice run over
KEY_MULTIPLE_CHOICE_TASKSusingevaluate_levanter_lm_evaluation_harness.
At the time of writing, KEY_GENERATION_TASKS includes:
ifevalgsm8k_cotdrophumanevalbbh_cot_fewshotminerva_math
KEY_MULTIPLE_CHOICE_TASKS currently includes:
mmlu0-shotmmlu5-shottruthfulqa_mc2
3. Build a Custom Eval Step¶
Use the lower-level helpers when you want a custom task list or evaluator:
from fray.cluster import ResourceConfig
from experiments.evals.evals import evaluate_lm_evaluation_harness
from marin.evaluation.evaluation_config import EvalTaskConfig
from marin.execution.executor import executor_main
custom_tasks = [
EvalTaskConfig(name="commonsense_qa", num_fewshot=5),
EvalTaskConfig(name="openbookqa", num_fewshot=0),
]
custom_step = evaluate_lm_evaluation_harness(
model_name="custom_eval",
model_path="gs://path/to/model",
evals=custom_tasks,
resource_config=ResourceConfig.with_tpu("v4-8"),
max_eval_instances=200,
)
if __name__ == "__main__":
executor_main(steps=[custom_step])
Use evaluate_levanter_lm_evaluation_harness instead when you specifically want the Levanter-backed evaluator path used by default_eval.
4. Run the Repository Example Scripts¶
The checked-in examples under experiments/evals/ are the safest starting points because they track real repository usage:
uv run python experiments/evals/run_key_evals.py
uv run python experiments/evals/run_base_model_evals.py
uv run python experiments/evals/run_sft_model_evals.py
uv run python experiments/evals/run_on_gpu.py
These scripts launch the requested hardware, load the selected checkpoint or model definition, run the configured eval tasks, and log results to W&B.
Parameter Reference¶
default_eval¶
step: checkpoint path,ExecutorStep, orInputNameto evaluate.resource_config: hardware configuration for the evaluator.evals: optional override for the task list. Defaults toCORE_TASKS.max_eval_instances: optional cap on evaluated examples.apply_chat_template: whether to apply the model chat template before evaluation.discover_latest_checkpoint: whether to resolve the latest checkpoint under the provided path.
default_key_evals¶
step: checkpoint path,ExecutorStep, orInputNameto evaluate.resource_config: hardware configuration for both returned steps.model_name: optional override for the logged model name.max_eval_instances: optional cap on evaluated examples.engine_kwargs: optional vLLM engine overrides for the generation step.
evaluate_lm_evaluation_harness¶
model_name: run name for tracking.model_path: checkpoint path to evaluate.evals: list ofEvalTaskConfigentries to run.max_eval_instances: optional cap on evaluated examples.engine_kwargs: optional vLLM engine overrides.resource_config: optional hardware configuration.apply_chat_template: whether to apply the chat template before evaluation.wandb_tags: optional W&B tags.discover_latest_checkpoint: whether to resolve the latest checkpoint under the provided path.
evaluate_levanter_lm_evaluation_harness¶
model_name: run name used to construct the executor step.model_path: checkpoint path to evaluate.evals: list ofEvalTaskConfigentries to run.resource_config: hardware configuration.max_eval_instances: optional cap on evaluated examples.apply_chat_template: whether to apply the chat template before evaluation.discover_latest_checkpoint: whether to resolve the latest checkpoint under the provided path.
For deeper dives, see:
docs/explanations/evaluation.mdexperiments/evals/task_configs.pyexperiments/evals/evals.py