Observability
Philosophy
Fabric treats observability as a first-class design constraint. Every component emits structured telemetry at every level of the stack. Platform operators should be able to diagnose any issue without modifying the platform or deploying additional tooling.
Metrics
The platform exports structured metrics covering GPU utilization, memory pressure, network throughput, storage I/O, scheduler queue depth, and job lifecycle events. Metrics are emitted in a format compatible with Prometheus and compatible ingestion systems.
Traces
Distributed tracing covers the path of a workload from submission through scheduling, execution, and completion. Traces enable diagnosis of latency issues across platform component boundaries.
Logs
All platform components emit structured logs with consistent schemas. Log levels, retention, and aggregation are configurable. Platform logs are available for both real-time diagnosis and historical analysis.