Scheduling·June 2026·10 min read

Distributed Scheduling for AI Workloads

Scheduling is the core decision-making system of any compute platform. For AI workloads, the scheduling problem has properties that require different algorithms than those used in general-purpose cluster schedulers.

The gang scheduling requirement

Distributed training jobs are not embarrassingly parallel. Each worker process communicates with every other worker process during gradient synchronization. If any worker fails to start, or if workers start at different times, the job cannot make progress.

This means distributed training jobs must be gang-scheduled — all required resources must be allocated atomically. The scheduler must either commit resources for all workers simultaneously, or not commit at all.

Backfill scheduling

Backfill scheduling addresses cluster fragmentation by allowing smaller jobs to run in spaces that would otherwise be idle while waiting for a large job's resources to become available. The algorithm works by identifying which queued jobs could complete before the high-priority job's resources become available and admitting those jobs ahead of it.

Priority and preemption

In multi-tenant clusters, different teams and workloads have different priority levels. Preemption allows the scheduler to evict lower-priority jobs to make room for higher-priority arrivals. For AI training, preemption is expensive — a preempted job must restart from its last checkpoint, losing potentially hours of compute. Fabric implements configurable preemption policies that allow organizations to tune this tradeoff.

Fairshare scheduling

In shared research clusters, fairness between teams is as important as efficiency. Fairshare scheduling tracks historical resource consumption and biases future allocations toward teams that have consumed less than their fair share — while accounting for the bursty nature of large training runs.

Back to Engineering Blog