Ever since Nvidia acquired Mellanox for $6.9 billion in 2019, the company has had a strong focus on Infiniband, which was a core element of the Mellanox portfolio. In the generative artificial intelligence (AI) era, Infiniband has become increasingly relevant as a high-throughput low-latency fabric.
Despite the power of Infiniband, there is still a continuing need for Ethernet as it remains the default for most data centers and that's a reality that isn't likely to change. To help serve AI networking needs of modern data centers, Nvidia is pushing its Spectrum-X Ethernet platform as a path forward.
In a keynote ahead of the Computex silicon show in Taiwan, Nvidia CEO and founder Jenson Huang detailed what Spectrum-X is all about and what it will enable.
“We have two types of networking, we have Infiniband, which has been used in supercomputing and AI factories all over the world and it is growing incredibly fast for us,” Huang said. “However, not every data center can handle InfiniBand, because they've already invested their ecosystem in Ethernet for too long, so what we've done is we've brought the capabilities of Infiniband to the Ethernet architecture, which is incredibly hard.”
What's wrong with Ethernet for AI? Huang explained during his keynote what he sees as the shortcomings of regular Ethernet for AI use case cases.
“Ethernet was designed for high average throughput,” he said.
With modern Ethernet the basic design is enabled for multi node communication across a data center or across the internet. In contrast he noted that with deep learning and AI use cases GPUs are not communicating with people on the internet, they're mostly communicating with each other.
That type of use case has different demands, which is what the Spectrum-X is designed to help solve.
According to Nvidia, Spectrum-X is the world's first Ethernet fabric purpose-built for AI workloads, accelerating generative AI network performance by a 1.6 times over traditional Ethernet fabrics. The platform features the NVIDIA Spectrum X Ethernet switch and the NVIDIA BlueField-3 SuperNIC.
The tech inside Spectrum-X The Spectrum-X architecture addresses a few core issues that Huang said are typically shortcomings with traditional Ethernet fabric.
Huang said that Nvidia's Spectrum-X integrates advanced network level remote direct memory access (RDMA) which has significant performance gains. Nvidia is also integrating enhanced congestion controls on the Spectrum-X switches.
“The switch does telemetry at all times incredibly fast,” Huang said. “Whenever the GPUs or the NICs are sending too much information, we can tell them to back off so that it doesn't create hotspots.”
Adaptive routing is another update to Ethernet that Nvidia is integrating. Huang noted that
Ethernet needs to transmit and receive data packets in order which can lead to congestion. With the adaptive routing approach whenever Spectrum-X sees congestion or under-utilized ports, irrespective of the ordering, data will be sent to the available ports and the Bluefield NIC reorders it. Huang also highlighted noise isolation as another core enhancement to Ethernet that is built into Spectrum-X.
Ethernet to support millions of GPUs During his keynote Huang talked about the product direction for Spectrum-X.
The current pipeline of products includes the Spectrum-X 800 which has 51.2 terabits per second of switching capacity. Looking forward he said the Spectrum-X 800 Ultra and X-1600 are next in line, boosting capacity even further.
Huang said that the x-800 is designed to support tens of thousands of GPU, the X 800 Ultra is designed for hundreds of thousands of GPUs and X 1600 is designed for millions of GPUs.
“The amount of generation we're going to do in the future is going to be extraordinary,” Huang said.