Network Fabric
AI network requirements
Distributed training generates intensive all-to-all communication patterns — gradient synchronization, parameter broadcasting, and collective reductions — that place different demands on the network than traditional east-west service traffic.
Collective communication
Fabric provides platform-level support for collective communication primitives: AllReduce, AllGather, ReduceScatter, and Broadcast. These operations are optimized for the network topology of the underlying hardware cluster, minimizing communication time during distributed training.
Bandwidth management
Network bandwidth is a shared resource in multi-tenant clusters. Fabric implements bandwidth-aware job placement to co-locate communicating workers and bandwidth allocation policies to prevent training jobs from starving each other of network capacity.