Evaluation Overview¶

This document explains how Marin evaluates models and where to find runnable workflows.

For step-by-step usage, start with: - Running Evaluations with Marin for multiple-choice, generation, and key eval suites. - Harbor Framework Integration for Harbor-backed agent and benchmark evaluation.

Evaluation modes¶

Marin supports three primary evaluation paths:

Multiple-choice tasks: run through lm-evaluation-harness (during training or as standalone eval jobs).
Generation tasks: run through a vLLM-backed evaluator pipeline.
Harbor tasks: run through Marin's Harbor integration for containerized agent benchmarks and registry datasets.

Multiple-choice evaluation (LM Evaluation Harness)¶

For multiple-choice tasks, Marin uses a fork of lm-evaluation-harness: https://github.com/stanford-crfm/lm-evaluation-harness

Key integration points: - train_lm runs in-loop evaluations periodically and logs to W&B when an EvalSuite is provided. - default_eval runs standalone harness evaluation after training (or on an existing checkpoint).

Task sets¶

Task sets are configured in task_configs.py.

CORE_TASKS is the default for in-loop and standalone harness evals.
CORE_TASKS_PLUS_MMLU extends CORE_TASKS with MMLU.
You can define custom task lists in task_configs.py and pass them to default_eval.

Note

See levanter_lm_eval_evaluator.py for the default evaluator implementation. Additional evaluators live in lib/marin/src/marin/evaluation/evaluators.

Reported metrics¶

Beyond task accuracy, Marin tracks these multiple-choice metrics:

Bits per byte (bpb): bpb = -log_prob / byte_length * ln(2)
Log probability (logprob): raw log probability of the correct answer.
Choice log probability (choice_logprob): log_prob_correct - log(sum(exp(log_prob_i)))
Length-normalized choice probability (choice_prob_norm): exp(log_prob_correct / (byte_length_correct * ln(2))) / sum(exp(log_prob_i / (byte_length_i * ln(2))))

Generation-based evaluation¶

Generation tasks (for example AlpacaEval, HumanEval, GSM8K, and MATH) use a fast inference backend, typically vLLM.

Task and suite definitions are in task_configs.py.
A common entrypoint is run_key_evals.py.
Current generation-eval setup is documented in Running Evaluations with Marin.

In current Marin workflows, generation evals are commonly run with Dockerfile.vllm or as dedicated vLLM jobs submitted through Iris.

Harbor-based evaluation¶

Harbor tasks use evaluate_harbor and the Harbor evaluator integration to run registry datasets in containerized environments.

Harbor supports agent-style benchmarks such as AIME, Terminal-Bench, SWE-bench Verified, and other registry datasets.
Marin's Harbor integration supports local Docker and hosted environments such as Daytona, E2B, and Modal.
Setup, examples, and environment requirements are documented in Harbor Framework Integration.