Chief Information Officers (CIOs) are increasingly focused on maximizing the benefits of artificial intelligence (AI) investments, which include enhanced productivity, improved customer experience (CX), and the advancement of digital transformation initiatives. Interest in AI infrastructure, particularly in graphics processing units (GPUs) and AI servers, has surged among Gartner clients.
GPU Infrastructure Demand
From October 2022 to October 2024, inquiries regarding GPUs and AI infrastructure have nearly quadrupled year-over-year. Organizations are assessing various deployment models, including cloud-hosted, on-premise, and hybrid solutions. Some businesses opt for comprehensive “full-stack” AI platforms that bundle GPU, computational, storage, and networking resources, while others prefer picking and integrating components individually. The distinct requirements posed by AI workloads differ significantly from traditional data center tasks.
Networking Technologies for GPU Clusters
To facilitate GPU connectivity, several interconnect technologies are available, including Ethernet, InfiniBand, and NVLink. Gartner clients often question which technology is best for their GPU clusters. Each option presents unique advantages depending on specific use cases. It’s essential to note that these technologies can be used together, allowing enterprises to expand beyond simple rack setups.
Despite common misconceptions suggesting that only Infiniband or proprietary interconnects can ensure optimal performance, Gartner recommends deploying Ethernet for GPU clusters, especially those scaling up to thousands of connections. Ethernet offers reliable performance backed by entrenched enterprise support and a broad supplier ecosystem.
Optimizing Network Deployments for AI Workloads
The prevalent practice for general-purpose computing workloads involves a leaf/spine network topology. However, this setup may not be ideal for AI workloads. Co-locating AI tasks with traditional workloads can lead to performance degradation due to “noisy neighbor” effects, extending processing times unnecessarily.
Investment in networking switches is generally a minor cost within the broader AI infrastructure budget, often comprising 15% or less. As a result, using existing switches to save costs can result in suboptimal performance for AI tasks. Gartner suggests implementing dedicated physical switches tailored for GPU connectivity and recommends minimizing the number of physical hops by exploring alternatives to the conventional leaf-spine topology. For GPU clusters under 500 units, one or two switches are ideal, while those exceeding 500 GPUs should aim for a dedicated AI Ethernet fabric, potentially shifting from typical top-of-rack to middle-of-row or modular switching designs.
Enhancing Ethernet Infrastructure
For optimal GPU connectivity, Gartner advocates for the use of dedicated switches that meet certain specifications. These include:
- High-speed interfaces, particularly 400Gbps access ports and above.
- Support for lossless Ethernet, incorporating advanced congestion-management solutions, such as data center quantized congestion notification (DCQCN).
- Capabilities for advanced traffic balancing, including congestion-aware load balancing.
- Remote Direct Memory Access (RDMA)-capable load balancing and packet spraying.
Advanced Management for AI Networking
Proper software management is crucial for AI networking infrastructure. This includes implementing management tools capable of diagnosing and resolving issues efficiently. Essential features include high-resolution telemetry for troubleshooting and the ability to monitor real-time metrics such as bandwidth usage, packet loss, jitter, latency, and availability, ideally at sub-second intervals.
Support for Ultra Ethernet and Accelerators
When developing AI fabrics, leaders should consider equipment manufacturers that commit to supporting the Ultra Ethernet Consortium (UEC) and Ultra Accelerator Link (UAL) specifications. The UEC is working towards establishing industry standards for high-performance Ethernet suitable for AI workloads. With no finalized standard as of February 2025, Gartner anticipates a proposal by the end of that year. This standardization aims to enhance interoperability among suppliers and reduce reliance on proprietary technologies.
Additionally, a related effort focused on UAL aims to standardize high-speed accelerator interconnect technologies to meet bandwidth needs that exceed current Ethernet or InfiniBand capabilities.
Mitigating Implementation Risks
Due to the demanding performance requirements for AI workloads, it is vital that the connection between GPUs and network switches is optimized and devoid of errors. To address potential implementation challenges, Gartner advises adhering to validated, co-certified implementation guides offered by both network and GPU suppliers. Following co-certified designs minimizes the likelihood of technical issues and results in quicker recovery times when problems arise.
This content draws from insights provided in the Gartner report, which details key networking practices to support AI workloads in data centers.