Hardware Resource Configuration¶
Marin uses Fray for scheduling and resource management (dispatching to Iris on shared clusters, or to a local backend for laptop runs). The fray library provides unified resource configuration types that translate to concrete cluster resource requests.
ResourceConfig¶
The main entry point for resource configuration. Use the static factory methods to create configurations:
from fray.cluster import ResourceConfig
# TPU configuration
tpu_config = ResourceConfig.with_tpu("v4-8")
tpu_multislice = ResourceConfig.with_tpu("v4-8", slice_count=2)
# GPU configuration
gpu_config = ResourceConfig.with_gpu("H100", count=8)
gpu_auto = ResourceConfig.with_gpu() # auto-detect GPU type
# CPU-only configuration
cpu_config = ResourceConfig.with_cpu()
ResourceConfig
dataclass
¶
Resource requirements for a single task/replica.
replicas specifies gang-scheduled replica count (e.g. TPU slices for
multislice training). It is also on JobRequest; when both are set,
JobRequest.replicas takes precedence. This field exists here so that
convenience builders like with_tpu(..., slice_count=4) can carry the
replica count alongside the resource spec.
image is an optional override for the task container image. When None,
the backend uses its cluster-configured default. Used for jobs that need
a custom runtime (e.g. an image with runsc/skopeo for sandboxing
untrusted child workloads).
Attributes¶
device_alternatives
class-attribute
instance-attribute
¶
Functions¶
with_tpu
staticmethod
¶
Create a resource config for TPU(s).
When tpu_type is a list, the first entry is canonical (used for
chip_count, env_vars, resource sizing) and the rest are alternatives.
All types in a list must share both vm_count and chips_per_vm:
a TPU VM is the atomic scheduling unit, so mixing variants with
different per-VM chip counts (e.g. v6e-4 + v6e-8) would let
the scheduler co-locate two partial-VM jobs onto a VM that cannot
actually be shared.
Device Configurations¶
These are the underlying device types wrapped by ResourceConfig: