Skip to content

Harbor Framework Integration

This document describes Marin's integration with Harbor, a framework for evaluating and optimizing agents in containerized environments.

Overview

The Harbor integration enables running any Harbor dataset from the Harbor registry without custom adapters. Harbor provides 45+ benchmarks including:

  • AIME (60 math problems from AIME 2024, 2025-I, 2025-II)
  • Terminal-Bench (89 terminal tasks)
  • SWE-bench Verified (500 software engineering tasks)
  • And 40+ more benchmarks for code, reasoning, data science, and more

Key Features

Generic Integration - No custom adapters needed for each benchmark ✅ 45+ Datasets - Access entire Harbor registry with one evaluator ✅ Sandboxed Execution - Tasks run in isolated environments (Docker or cloud workspaces) ✅ Multi-Environment Support - Local Docker, Daytona, E2B, Modal ✅ Agent Flexibility - Use Claude Code, custom agents, or build your own

Quick Start

Prerequisites

Harbor is an optional dependency. The evaluate_harbor() function automatically installs it via the harbor extra.

Alternatively, install manually:

cd lib/marin
uv add --optional harbor "harbor>=0.1.42"

Running AIME Evaluation

from experiments.evals.evals import evaluate_harbor
from fray.cluster import ResourceConfig

# Evaluate AIME@1.0 (60 math problems)
step = evaluate_harbor(
    model_name="anthropic/claude-opus-4-1",
    model_path=None,  # API model
    dataset="aime",
    version="1.0",
    max_eval_instances=5,  # Start with 5 tasks
    agent="claude-code",
    n_concurrent=2,
    env="daytona",  # Use cloud sandboxes to avoid local Docker setup
)

Or use the provided sanity check script:

export ANTHROPIC_API_KEY=your_key_here
export DAYTONA_API_KEY=your_key_here
uv run python experiments/exp_harbor_aime_sanity_check.py --prefix ./runs

# Use local Docker instead of Daytona:
ENV_TYPE=local uv run python experiments/exp_harbor_aime_sanity_check.py --prefix ./runs

Available Datasets

See the full list at harborframework.com/registry.

Popular datasets include:

Dataset Version Tasks Description
aime 1.0 60 Competition math problems
terminal-bench 2.0 89 Terminal/bash tasks
swebench-verified 1.0 500 Software engineering bugs
ds-1000 6.0 1000 Data science problems
gpqa-diamond 1.0 198 Graduate-level science Q&A
usaco 2.0 304 Programming competition problems

API Reference

evaluate_harbor()

def evaluate_harbor(
    model_name: str,
    model_path: str | None,
    dataset: str,
    version: str = "1.0",
    max_eval_instances: int | None = None,
    resource_config: ResourceConfig | None = None,
    apply_chat_template: bool = False,
    wandb_tags: list[str] | None = None,
    generation_params: dict | None = None,
    agent: str = "claude-code",
    n_concurrent: int = 4,
    env: str = "local",
) -> ExecutorStep

Parameters: - model_name: Model identifier (e.g., "anthropic/claude-opus-4-1", "qwen2.5-7b-instruct") - model_path: Path to model (None for API models, GCS path for custom models) - dataset: Harbor dataset name from registry - version: Dataset version (default: "1.0") - max_eval_instances: Limit number of tasks (None = all tasks) - agent: Harbor agent type: - "claude-code" - Anthropic's Claude Code agent (default) - "terminus-2" - Harbor's reference agent - n_concurrent: Number of parallel trials (default: 4) - env: Container environment: - "local" - Local Docker (default, good for testing) - "daytona" - Daytona cloud containers (requires API key) - "e2b" - E2B containers (requires API key) - "modal" - Modal containers (requires API key) - wandb_tags: Additional W&B tags - resource_config: Fray resource configuration (dispatched to Iris or a local backend)

Examples

Terminal-Bench (Terminal Tasks)

step = evaluate_harbor(
    model_name="anthropic/claude-opus-4-1",
    model_path=None,
    dataset="terminal-bench",
    version="2.0",
    max_eval_instances=10,
    agent="claude-code",
    n_concurrent=4,
    env="local",
)

SWE-bench Verified (Software Engineering)

step = evaluate_harbor(
    model_name="anthropic/claude-opus-4-1",
    model_path=None,
    dataset="swebench-verified",
    version="1.0",
    max_eval_instances=50,
    agent="claude-code",
    n_concurrent=8,
    env="daytona",  # Use cloud for better performance
)

Custom Model (Qwen 2.5 Instruct)

step = evaluate_harbor(
    model_name="qwen2.5-7b-instruct",
    model_path="gs://marin-us-central2/models/qwen2.5-7b-instruct",
    dataset="aime",
    version="1.0",
    agent="terminus-2",
    n_concurrent=4,
    env="local",
)

Architecture

HarborEvaluator

The HarborEvaluator class in lib/marin/src/marin/evaluation/evaluators/harbor_evaluator.py provides the integration.

Key design decisions: 1. No adapters needed - Uses Harbor's registry system to load any dataset 2. CLI-based execution - Wraps harbor run command for reliability 3. Standardized results - Parses Harbor's JSON output into Marin format 4. Optional dependency - Harbor installed via --extra harbor only when needed

Result Format

Harbor returns results with per-task rewards (0.0 to 1.0):

{
  "trials": {
    "aime_60": {
      "reward": 1.0,
      "correct": true,
      "status": "success",
      "trajectory_length": 15
    },
    "aime_61": {
      "reward": 0.0,
      "correct": false,
      "status": "failed",
      "trajectory_length": 42
    }
  },
  "aggregate": {
    "total_trials": 60,
    "successful_trials": 42,
    "mean_reward": 0.70,
    "accuracy": 0.70
  }
}

## Environment Variables

Required for different agents and environments:

- `ANTHROPIC_API_KEY` - For Claude Code agent
- `OPENAI_API_KEY` - For OpenAI-based agents
- `DAYTONA_API_KEY` - For Daytona environment
- `E2B_API_KEY` - For E2B environment
- `MODAL_API_KEY` - For Modal environment

## Troubleshooting

### "harbor: command not found"

Harbor should be auto-installed via `pip_dependency_groups=["harbor"]`. If not:
```bash
cd lib/marin
uv sync --extra harbor

Docker errors in local environment

Ensure Docker daemon is running:

docker ps

Slow execution with local Docker

Consider using cloud environment for better performance:

env="daytona"  # Or "e2b", "modal"

API key errors

Check environment variables:

echo $ANTHROPIC_API_KEY

References