Hardware Resource Configuration¶

Marin uses Fray for scheduling and resource management (dispatching to Iris on shared clusters, or to a local backend for laptop runs). The fray library provides unified resource configuration types that translate to concrete cluster resource requests.

ResourceConfig¶

The main entry point for resource configuration. Use the static factory methods to create configurations:

from fray.cluster import ResourceConfig

# TPU configuration
tpu_config = ResourceConfig.with_tpu("v4-8")
tpu_multislice = ResourceConfig.with_tpu("v4-8", slice_count=2)

# GPU configuration
gpu_config = ResourceConfig.with_gpu("H100", count=8)
single_gpu = ResourceConfig.with_gpu("H100")  # count defaults to 1

# CPU-only configuration
cpu_config = ResourceConfig.with_cpu()

ResourceConfig `dataclass` ¶

Resource requirements for a single task/replica.

replicas specifies gang-scheduled replica count (e.g. TPU slices for multislice training). It is also on JobRequest; when both are set, JobRequest.replicas takes precedence. This field exists here so that convenience builders like with_tpu(..., slice_count=4) can carry the replica count alongside the resource spec.

image is an optional override for the task container image. When None, the backend uses its cluster-configured default. Used for jobs that need a custom runtime (e.g. an image with runsc/skopeo for sandboxing untrusted child workloads).

Attributes¶

cpu `class-attribute` `instance-attribute` ¶

cpu: float = 1

ram `class-attribute` `instance-attribute` ¶

ram: str = '4g'

disk `class-attribute` `instance-attribute` ¶

disk: str = '16g'

device `class-attribute` `instance-attribute` ¶

device: DeviceConfig = field(default_factory=CpuConfig)

preemptible `class-attribute` `instance-attribute` ¶

preemptible: bool = True

regions `class-attribute` `instance-attribute` ¶

regions: Sequence[str] | None = None

zone `class-attribute` `instance-attribute` ¶

zone: str | None = None

replicas `class-attribute` `instance-attribute` ¶

replicas: int = 1

device_alternatives `class-attribute` `instance-attribute` ¶

device_alternatives: Sequence[str] | None = None

image `class-attribute` `instance-attribute` ¶

image: str | None = None

Functions¶

chip_count ¶

chip_count() -> int

Total accelerator chips across all replicas.

device_flops ¶

device_flops(dtype: str = 'bf16') -> float

total_flops ¶

total_flops(dtype: str = 'bf16') -> float

scale ¶

scale(
    factor: float | None = None,
    /,
    *,
    cpu: float | None = None,
    ram: float | None = None,
    disk: float | None = None,
) -> ResourceConfig

Return a copy with cpu/ram/disk multiplied by the given factors.

rc.scale(2) multiplies cpu, ram, and disk by 2; rc.scale(0.5) halves all three. Keyword args cpu, ram, and disk are multiplicative factors for individual dimensions (e.g. cpu=0.5 halves CPU); omitted dimensions keep their current value. factor cannot be combined with keyword factors.

with_tpu `staticmethod` ¶

with_tpu(
    tpu_type: str | Sequence[str],
    *,
    slice_count: int = 1,
    **kwargs: Any
) -> ResourceConfig

Create a resource config for TPU(s).

When tpu_type is a list, the first entry is canonical (used for chip_count, env_vars, resource sizing) and the rest are alternatives. All types in a list must share both vm_count and chips_per_vm: a TPU VM is the atomic scheduling unit, so mixing variants with different per-VM chip counts (e.g. v6e-4 + v6e-8) would let the scheduler co-locate two partial-VM jobs onto a VM that cannot actually be shared.

cpu/ram default to DEFAULT_TPU_HOST_FRACTION (50%) of the primary type's per-VM host (see TPU_HOST_RESOURCES), leaving headroom for CPU-task multiplexing while giving training enough host RAM for checkpoint serialization. Pass cpu=/ram= to override.

with_gpu `staticmethod` ¶

with_gpu(
    gpu_type: str, count: int = 1, **kwargs: Any
) -> ResourceConfig

with_cpu `staticmethod` ¶

with_cpu(**kwargs: Any) -> ResourceConfig

Device Configurations¶

These are the underlying device types wrapped by ResourceConfig:

CPU¶

CpuConfig `dataclass` ¶

CPU-only device configuration.

Attributes¶

kind `class-attribute` `instance-attribute` ¶

kind: DeviceKind = 'cpu'

variant `class-attribute` `instance-attribute` ¶

variant: str = 'cpu'

Functions¶

chip_count ¶

chip_count() -> int

device_flops ¶

device_flops(dtype: str = 'bf16') -> float

default_env_vars ¶

default_env_vars() -> dict[str, str]

GPU¶

GpuConfig `dataclass` ¶

GPU device configuration.

Attributes¶

variant `instance-attribute` ¶

variant: GpuType

kind `class-attribute` `instance-attribute` ¶

kind: DeviceKind = 'gpu'

count `class-attribute` `instance-attribute` ¶

count: int = 1

Functions¶

chip_count ¶

chip_count() -> int

device_flops ¶

device_flops(dtype: str = 'bf16') -> float

total_flops ¶

total_flops(dtype: str = 'bf16') -> float

default_env_vars ¶

default_env_vars() -> dict[str, str]

TPU¶

TpuConfig `dataclass` ¶

TPU device configuration.

Parameters:

variant (TpuType) –

TPU accelerator type (e.g., "v5litepod-16", "v4-8")
topology (str | None, default: None ) –

Optional topology specification (e.g., "2x2x1")

Attributes¶

variant `instance-attribute` ¶

variant: TpuType

kind `class-attribute` `instance-attribute` ¶

kind: DeviceKind = 'tpu'

topology `class-attribute` `instance-attribute` ¶

topology: str | None = None

Functions¶

chip_count ¶

chip_count() -> int

Return the number of chips per VM for this TPU type.

vm_count ¶

vm_count() -> int

device_flops ¶

device_flops(dtype: str = 'bf16') -> float

total_flops ¶

total_flops(dtype: str = 'bf16') -> float

default_env_vars ¶

default_env_vars() -> dict[str, str]

Hardware Resource Configuration¶

ResourceConfig¶

ResourceConfig dataclass ¶

Attributes¶

cpu class-attribute instance-attribute ¶

ram class-attribute instance-attribute ¶

disk class-attribute instance-attribute ¶

device class-attribute instance-attribute ¶

preemptible class-attribute instance-attribute ¶

regions class-attribute instance-attribute ¶

zone class-attribute instance-attribute ¶

replicas class-attribute instance-attribute ¶

device_alternatives class-attribute instance-attribute ¶

image class-attribute instance-attribute ¶

Functions¶

chip_count ¶

device_flops ¶

total_flops ¶

scale ¶

with_tpu staticmethod ¶

with_gpu staticmethod ¶

with_cpu staticmethod ¶

Device Configurations¶

CPU¶

CpuConfig dataclass ¶

Attributes¶

kind class-attribute instance-attribute ¶

variant class-attribute instance-attribute ¶

Functions¶

chip_count ¶

device_flops ¶

default_env_vars ¶

GPU¶

GpuConfig dataclass ¶

Attributes¶

variant instance-attribute ¶

kind class-attribute instance-attribute ¶

count class-attribute instance-attribute ¶

Functions¶

chip_count ¶

device_flops ¶

total_flops ¶

default_env_vars ¶

TPU¶

TpuConfig dataclass ¶

Attributes¶

variant instance-attribute ¶

kind class-attribute instance-attribute ¶

topology class-attribute instance-attribute ¶

Functions¶

chip_count ¶

vm_count ¶

device_flops ¶

total_flops ¶

default_env_vars ¶

ResourceConfig `dataclass` ¶

cpu `class-attribute` `instance-attribute` ¶

ram `class-attribute` `instance-attribute` ¶

disk `class-attribute` `instance-attribute` ¶

device `class-attribute` `instance-attribute` ¶

preemptible `class-attribute` `instance-attribute` ¶

regions `class-attribute` `instance-attribute` ¶

zone `class-attribute` `instance-attribute` ¶

replicas `class-attribute` `instance-attribute` ¶

device_alternatives `class-attribute` `instance-attribute` ¶

image `class-attribute` `instance-attribute` ¶

with_tpu `staticmethod` ¶

with_gpu `staticmethod` ¶

with_cpu `staticmethod` ¶

CpuConfig `dataclass` ¶

kind `class-attribute` `instance-attribute` ¶

variant `class-attribute` `instance-attribute` ¶

GpuConfig `dataclass` ¶

variant `instance-attribute` ¶

kind `class-attribute` `instance-attribute` ¶

count `class-attribute` `instance-attribute` ¶

TpuConfig `dataclass` ¶

variant `instance-attribute` ¶

kind `class-attribute` `instance-attribute` ¶

topology `class-attribute` `instance-attribute` ¶