Scheduling
AI scheduling challenges
AI workloads present scheduling challenges not well-handled by general-purpose schedulers. Training jobs must start all required resources simultaneously. Long-running jobs have different priority semantics than short batch jobs. GPU fragmentation is a major source of inefficiency in multi-tenant clusters.
Gang scheduling
Distributed training jobs require all worker processes to start simultaneously — a partially started job wastes the resources allocated to it. Fabric implements gang scheduling with admission control: a job is either fully admitted or not admitted at all.
Backfill scheduling
Backfill scheduling improves cluster utilization by allowing lower-priority jobs to run in resources that would otherwise be idle while waiting for a high-priority job to have sufficient resources. Fabric's scheduler implements backfill with configurable time bounds to prevent starvation.
Priority and preemption
Fabric supports multiple priority tiers with configurable preemption policies. Higher-priority workloads can preempt lower-priority workloads when resources are constrained. Preemption policies can be tuned to balance fairness, efficiency, and job completion guarantees.