Skip to content

Hardware Resource Configuration

Marin uses Fray for scheduling and resource management (dispatching to Iris on shared clusters, or to a local backend for laptop runs). The fray library provides unified resource configuration types that translate to concrete cluster resource requests.

ResourceConfig

The main entry point for resource configuration. Use the static factory methods to create configurations:

from fray.cluster import ResourceConfig

# TPU configuration
tpu_config = ResourceConfig.with_tpu("v4-8")
tpu_multislice = ResourceConfig.with_tpu("v4-8", slice_count=2)

# GPU configuration
gpu_config = ResourceConfig.with_gpu("H100", count=8)
gpu_auto = ResourceConfig.with_gpu()  # auto-detect GPU type

# CPU-only configuration
cpu_config = ResourceConfig.with_cpu()

ResourceConfig dataclass

Resource requirements for a single task/replica.

replicas specifies gang-scheduled replica count (e.g. TPU slices for multislice training). It is also on JobRequest; when both are set, JobRequest.replicas takes precedence. This field exists here so that convenience builders like with_tpu(..., slice_count=4) can carry the replica count alongside the resource spec.

image is an optional override for the task container image. When None, the backend uses its cluster-configured default. Used for jobs that need a custom runtime (e.g. an image with runsc/skopeo for sandboxing untrusted child workloads).

Attributes
cpu class-attribute instance-attribute
cpu: float = 1
ram class-attribute instance-attribute
ram: str = '4g'
disk class-attribute instance-attribute
disk: str = '16g'
device class-attribute instance-attribute
device: DeviceConfig = field(default_factory=CpuConfig)
preemptible class-attribute instance-attribute
preemptible: bool = True
regions class-attribute instance-attribute
regions: Sequence[str] | None = None
zone class-attribute instance-attribute
zone: str | None = None
replicas class-attribute instance-attribute
replicas: int = 1
device_alternatives class-attribute instance-attribute
device_alternatives: Sequence[str] | None = None
image class-attribute instance-attribute
image: str | None = None
Functions
chip_count
chip_count() -> int

Total accelerator chips across all replicas.

device_flops
device_flops(dtype: str = 'bf16') -> float
total_flops
total_flops(dtype: str = 'bf16') -> float
with_tpu staticmethod
with_tpu(
    tpu_type: str | Sequence[str],
    *,
    slice_count: int = 1,
    **kwargs: Any
) -> ResourceConfig

Create a resource config for TPU(s).

When tpu_type is a list, the first entry is canonical (used for chip_count, env_vars, resource sizing) and the rest are alternatives. All types in a list must share both vm_count and chips_per_vm: a TPU VM is the atomic scheduling unit, so mixing variants with different per-VM chip counts (e.g. v6e-4 + v6e-8) would let the scheduler co-locate two partial-VM jobs onto a VM that cannot actually be shared.

with_gpu staticmethod
with_gpu(
    gpu_type: str, count: int = 1, **kwargs: Any
) -> ResourceConfig
with_cpu staticmethod
with_cpu(**kwargs: Any) -> ResourceConfig

Device Configurations

These are the underlying device types wrapped by ResourceConfig:

CPU

CpuConfig dataclass

CPU-only device configuration.

Attributes
kind class-attribute instance-attribute
kind: DeviceKind = 'cpu'
variant class-attribute instance-attribute
variant: str = 'cpu'
Functions
chip_count
chip_count() -> int
device_flops
device_flops(dtype: str = 'bf16') -> float
default_env_vars
default_env_vars() -> dict[str, str]

GPU

GpuConfig dataclass

GPU device configuration.

Attributes
variant instance-attribute
variant: GpuType
kind class-attribute instance-attribute
kind: DeviceKind = 'gpu'
count class-attribute instance-attribute
count: int = 1
Functions
chip_count
chip_count() -> int
device_flops
device_flops(dtype: str = 'bf16') -> float
total_flops
total_flops(dtype: str = 'bf16') -> float
default_env_vars
default_env_vars() -> dict[str, str]

TPU

TpuConfig dataclass

TPU device configuration.

Parameters:

  • variant (TpuType) –

    TPU accelerator type (e.g., "v5litepod-16", "v4-8")

  • topology (str | None, default: None ) –

    Optional topology specification (e.g., "2x2x1")

Attributes
variant instance-attribute
variant: TpuType
kind class-attribute instance-attribute
kind: DeviceKind = 'tpu'
topology class-attribute instance-attribute
topology: str | None = None
Functions
chip_count
chip_count() -> int

Return the number of chips per VM for this TPU type.

vm_count
vm_count() -> int
device_flops
device_flops(dtype: str = 'bf16') -> float
total_flops
total_flops(dtype: str = 'bf16') -> float
default_env_vars
default_env_vars() -> dict[str, str]