Nvidia today claimed a new record for data processing unit (DPU) performance that saw a pair of BlueField-2 accelerators achieve 41 million IOPS of storage performance. The claim bests Fungible’s own 10 million IOPS record from late last month — at least on paper.
Input/output operations per second (IOPS) is a measure of how quickly read and write commands can be issued, and it’s a key metric for storage performance next to sheer bandwidth.
With higher bandwidth interconnects and high-speed NVMe storage, data processing speed is rapidly becoming the bottleneck for storage servers, especially in high-performance compute (HPC) environments like artificial intelligence (AI) and machine learning, Kevin Deierling, SVP of networking at NVIDIA, told SDxCentral.
High-performance storage is no longer limited by IOPS thanks to technologies like BlueField-2, he said. “What we think is important here is that we set the record straight that what we're limited by is the bandwidth.”
And Nvidia aims to address that particular challenge and raise the bar for IOPS again with its upcoming BlueField-3 DPU when it hits the market next year.
The BenchmarkNvidia’s record was achieved using two Hewlett Packard Enterprise (HPE) Proliant servers each boasting dual 40-core Intel Xeon Ice Lake processors, 512 gigabytes of memory, an unspecified quantity of PCIe Gen 4.0 NVMe storage, and a pair of BlueField-2 DPUs.
Nvidia’s BlueField-2 effectively melds an Arm-based computer, complete with data-centric hardware accelerators, into a NIC form factor. “We have a computer in front of the computer with BlueField-2,” Deierling said.
The two servers, one acting as the initiator and the other the target, were connected to one another by four Nvidia LinkX 100 Gb/s direct-attach passive copper cables for a total of 400 Gb/s of aggregate bandwidth.
Finally, the benchmark was conducted on the Storage Performance Development Kit (SPDK) software platform using both 4 kilobyte and 512 byte input/output (I/O) sizes. Connectivity was achieved using a combination of the NVMe over fabrics (NVMe-oF) storage protocol with either transmission control protocol (TCP) or the oh-so-delightfully named remote direct memory access over converged Ethernet (RoCE) protocol.
TCP was used to achieve the 41 million IOPS performance record in a 1oo% read test using a 512-byte I/O size. This is four greater than any other DPU on the market, the company boasted. Meanwhile, Nvidia achieved 46 million IOPS in a similar test using the RoCE protocol.
Besting Fungible?At first glance, Nvidia’s latest record appears to be a clear and decisive victory over rival DPU vendor Fungible’s recent 10 million IOPS record. However, the two tests aren’t directly comparable.
Fungible’s testing was conducted using a more powerful gigabyte server equipped with dual 64-core AMD EPYC processors, 2 terabytes of memory, and five of the vendor’s FC200 storage initiator cards. The testing methodology was also different with Nvidia’s test conducted internally, and Fungible’s record being set in collaboration with the San Diego Supercomputer Center.
While both vendors' tests were conducted using NVMe-oF and TCP, Nvidia's headline 41 million IOPS claim was set using a smaller 512-byte I/O size, while Fungible's score was achieved using a larger 4-kilobyte I/O size.
When running the same tests using a 4-kilobyte I/O size, Nvidia reported its dual BlueField-2 system achieved between 5.6 million and 10.8 million IOPS in a 100% read test over TCP. The highest performance was achieved using SPDK, while the lowest experience was experienced using Linux kernel 4.18.
Correction: This story was updated to correct inaccurate information provided by Nvidia and to clarify the network protocol used in the vendor's testing. Nvidia’s 41 million IOPS score was achieved using two BlueField-2 DPUs operating over TCP.
Editors note: This story was updated with additional information from Fungible regarding their test parameters.