Evaluation Overview¶
This document explains how Marin evaluates models and where to find runnable workflows.
For step-by-step usage, start with: - Running Evaluations with Marin for multiple-choice, generation, and key eval suites. - Harbor Framework Integration for Harbor-backed agent and benchmark evaluation.
Evaluation modes¶
Marin supports three primary evaluation paths:
- Multiple-choice tasks: run through
lm-evaluation-harness(during training or as standalone eval jobs). - Generation tasks: run through a vLLM-backed evaluator pipeline.
- Harbor tasks: run through Marin's Harbor integration for containerized agent benchmarks and registry datasets.
Multiple-choice evaluation (LM Evaluation Harness)¶
For multiple-choice tasks, Marin uses a fork of lm-evaluation-harness:
https://github.com/stanford-crfm/lm-evaluation-harness
Key integration points:
- default_train runs in-loop evaluations periodically and logs to W&B.
- default_eval runs standalone harness evaluation after training (or on an existing checkpoint).
Task sets¶
Task sets are configured in task_configs.py.
CORE_TASKSis the default for in-loop and standalone harness evals.CORE_TASKS_PLUS_MMLUextendsCORE_TASKSwith MMLU.- You can define custom task lists in
task_configs.pyand pass them todefault_eval.
Note
See levanter_lm_eval_evaluator.py for the default evaluator implementation.
Additional evaluators live in lib/marin/src/marin/evaluation/evaluators.
Reported metrics¶
Beyond task accuracy, Marin tracks these multiple-choice metrics:
- Bits per byte (
bpb):bpb = -log_prob / byte_length * ln(2) - Log probability (
logprob): raw log probability of the correct answer. - Choice log probability (
choice_logprob):log_prob_correct - log(sum(exp(log_prob_i))) - Length-normalized choice probability (
choice_prob_norm):exp(log_prob_correct / (byte_length_correct * ln(2))) / sum(exp(log_prob_i / (byte_length_i * ln(2))))
Generation-based evaluation¶
Generation tasks (for example AlpacaEval, HumanEval, GSM8K, and MATH) use a fast inference backend, typically vLLM.
- Task and suite definitions are in
task_configs.py. - A common entrypoint is
run_key_evals.py. - Current generation-eval setup is documented in Running Evaluations with Marin.
In current Marin workflows, generation evals are commonly run with Dockerfile.vllm or as dedicated vLLM jobs submitted through Iris.
Harbor-based evaluation¶
Harbor tasks use evaluate_harbor and the Harbor evaluator integration to run registry datasets in containerized environments.
- Harbor supports agent-style benchmarks such as AIME, Terminal-Bench, SWE-bench Verified, and other registry datasets.
- Marin's Harbor integration supports local Docker and hosted environments such as Daytona, E2B, and Modal.
- Setup, examples, and environment requirements are documented in Harbor Framework Integration.