NVIDIA says its infinite technology Ethernet Spectrum-X network can increase the bandwidth of the storage tissue network by almost 50%.
Spectrum-X is a combination of the Ethernet Specrum-4 Asic switching offer that accompanies its infiniband products. It supports Roce V2 for access to Distant Direct memory (RDMA) on Converged Ethernet and the Bluefield-3 Supernic. NVIDIA Infiniband products have an adaptive routing to send data packets to the least congestioned network roads when the routes initially selected are occupied or a liaison failure occurs. THE Spectrum-4 SN5000 The switch provides up to 51.2 TBPS of bandwidth with 64 Ethernet ports X 800 GBPS. There are ROCE extensions for adaptive routing and congestion control, and these operate with the Bluefield-3 product.

Packets routed in an adaptive way can arrive at a destination outside the sequence, and the Bluefield-3 product from NVIDIA can reflect them properly, “place them in the memory of the host and the maintenance of the adaptive routing transparent to the application”.
An Nvidia blog explains that, because the adaptive spectrum-x routing is capable of alleviating flow collisions and increasing efficient bandwidth, effective storage performance is much higher than with Roce V2, “Protocol Ethernet networking used by a majority of data for the AI compute and storage fabrics.
The blog discusses control points during a great language training (LLM), which can take days, weeks or even months. The working state is registered periodically so that if training follows fails for any reason, it can be restarted from a state of control point recorded instead of launching it from the start. He says: “With models of billions and billions of billions of parameters, these states of the point of control become large enough – up to several data of data for the most LLM of today – that the savings or the Restorations generate “elephant flows” … which can overwhelm the buffers and switching ties. “
This assumes that the control point data is sent to shared storage, a table, for example, on a network and not to local storage in GPU servers, a technique used in Microsoft LLM training.
Nvidia also indicates that these network traffic tips can occur in LLM inference operations when RAG (recovery-generation) data is sent to the LLM from a network storage source holding the cloth data in a vector database. He explains that “there are many vector databases and can be quite significant, especially in the case of knowledge bases made up of images and videos.”
RAG data must be sent with minimum latency to the LLM and this becomes even more important when the LLM runs in “Multitian generative factories, where the number of requests per second is massive”.
Nvidia says he has tested these Spectrum-4 features with her Israel-1 AI Supercalculator. The test process has measured the reading and writing bandwidth generated by customers of the NVIDIA HGX H100 GPU server accessing storage, once with the network configured as a standard ROCE V2 fabric, then with routing and Adaptive congestion from activated spectrum-x.


The tests were performed using different number of GPU servers as customers, ranging from 40 to 800 GPU. In all cases, Spectrum-X worked better, the bandwidth in reading improving from 20 to 48% and writing a bandwidth going from 9 to 41%.
Nvidia says that Spectrum-X works well with its other offers to speed up GPU data storage:
- Cloud air -based network simulation tool for the modeling of switches, outlines and storage.
- Operating system of the Cumulus Linux network built around automation and APIs, “ensure large -scale operations and management”.
- Doca SDK for supernic and DPUs, offering programmability and performance for storage, safety, etc.
- Set of validation tools of the netq network which fits into the Switch telemetry.
- Gpudirect storage for the direct data path between storage and GPU memory, which makes data transfer more efficient.
We can expect Nvidia partners such as DDN, Dell, Hpe, Lenovo, Vast Data and Weka to support these Spectrum-X features. Indeed, DDN, large data and Weka have already done so.