The National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign has launched its highly anticipated DeltaAI system.
DeltaAI is an advanced AI computing and data resource that will be a companion system to NCSA’s Delta, a 338-node HPE Cray supercomputer installed in 2021. The new DeltaAI was funded by the National Science Foundation with nearly 30 million dollars in rewards. and will be available to researchers across the country through the NSF ACCESS program and the National Artificial Intelligence Research Resource (NAIRR) pilot project.
The system will accelerate complex AI, machine learning, and HPC applications running terabytes of data using advanced AI hardware, including Nvidia H100 Hopper GPUs and GH200 Grace Hopper superchips.
This week, HPC wire sat down with NCSA Director Bill Gropp at SC24 in Atlanta to get the story on the new DeltaAI system that became fully operational last Friday.
Gropp says DeltaAI was inspired by the growing demand NCSA saw for GPUs when designing and deploying the original Delta system.
“The name Delta comes from the fact that we have seen these advances in computer architecture, particularly in GPUs and other interfaces. And some of the community has adopted them, but not the entire community, and we really think that’s an important direction for people to take,” Gropp told HPC.thread.
“So we proposed Delta to the NSF and got their funding. I think it was the first, basically, almost all GPU resource since Keeneland, a long, long time ago, and we expected it to be a mix of modeling simulation, like dynamics molecular, fluid flows and AI. But as we rolled out (Delta), AI took off and the demand got bigger and bigger. »
The original Delta system, with its Nvidia A100 GPUs and more modest amounts of GPU memory, was cutting edge for its time, Gropp says, but after the emergence and proliferation of large language models and other forms of generative AI, the situation has changed.
“We looked at what people needed and realized there was a huge demand for GPU resources for AI research and more GPU memory would be needed for these larger models,” he said. declared.
Increasing GPU Power to Demystify AI
The new DeltaAI system will deliver approximately twice the performance of the original Delta, delivering double-precision petaflops of performance (FP64) for tasks requiring high numerical precision, such as fluid dynamics or climate modeling, and a staggering 633 half-precision petaflops (FP16), optimized for machine learning and AI workloads.
This extraordinary computing capacity is driven by 320 NVIDIA Grace Hopper GPUs, each equipped with 96 GB of memory, for a total of 384 GB of GPU memory per node. The nodes are further supported by 14 PB of storage at up to 1 TB/s and are interconnected with a highly scalable fabric.
Gropp says the additional NSF funding for Delta and DeltaAI will allow them to deploy additional nodes with more than a terabyte of GPU memory per node, which will support AI research, particularly studies dedicated to understanding training and inference with LLMs. Gropp hopes that this aspect of DeltaAI’s research potential will be a boon for explainable AI, as these enormous memory resources allow researchers to manage larger models, process more data simultaneously, and conduct deeper explorations of the mechanics of AI systems.
“We’ve done a tremendous amount of research on explainable AI, trustworthy AI, and understanding how inference works,” says Gropp, focusing on the key questions driving this work: “Why models work they this way? How can you improve their quality and reliability?
Understanding how AI models reach specific conclusions is crucial for identifying bias to ensure fairness and increase accuracy, especially in high-stakes applications like healthcare and finance. Explainable AI has emerged as a response to “black box” AI systems and models that are not easily understood or accessed and often lack transparency in how they process inputs to generate results.
As AI adoption accelerates, the demand for explainability and accuracy increases in tandem, raising questions like “How can you reduce what is essentially interpolation error in these models so that people can count on what they get out of them? » said Gropp. “It was to see this demand that we proposed this. I think that’s why the NSF funded it, and that’s why we’re so excited.
Democratizing AI… and HPC?
DeltaAI will be made available to researchers nationwide through the NSF ACCESS program and the National Artificial Intelligence Research Resource (NAIRR) pilot initiative. This broad accessibility is designed to foster collaboration and extend the reach of DeltaAI’s advanced computing capabilities.
“We’re really excited to see more and more users take advantage of our cutting-edge GPUs, as well as the kind of support we can offer and the ability to work with other groups and share content. our resources,” Gropp said.
Gropp says the new system will play a dual role in advancing AI and more conventional computer science. While DeltaAI’s nodes are optimized for AI-specific workloads and tools, they are also accessible to HPC users, as the system design makes it a versatile platform that serves both research and AI and traditional HPC applications.
HPC workloads such as molecular dynamics, fluid mechanics, and structural mechanics will significantly benefit from the system’s advanced architecture, particularly its multi-GPU nodes and unified memory. These features address common HPC challenges, such as memory bandwidth limitations, by providing considerable bandwidth that improves performance for compute-intensive tasks.
DeltaAI is integrated with the original Delta system on the same Slingshot network and shared file system, representing a forward-thinking approach to infrastructure design. This interconnected configuration not only maximizes resource efficiency, but also lays the foundation for future scalability.
Gropp says plans are already in place to add new systems over the next two years, reflecting a move toward a continuous upgrade model rather than waiting until current hardware becomes obsolete. While this approach may present challenges in managing a more heterogeneous system, the benefits of staying at the forefront of innovation far outweigh the complexities.
This innovative approach to infrastructure design ensures that traditional computing workloads are maintained and seamlessly integrated with AI advancements, fostering a balanced and versatile research environment in computing’s AI-saturated landscape modern technology that can lead to AI fatigue.
“The hype around AI can be exhausting,” notes Gropp. “We need to be careful, because what AI can do has tremendous value. But there are a lot of things it can’t do, and I don’t think it will ever be able to do it, at least with the technologies we have.
DeltaAI exemplifies NCSA’s commitment to advancing both the frontiers of scientific understanding and the practical application of AI and HPC technologies. Scientific applications such as turbulence modeling benefit from the combination of HPC and AI.
“I think it’s an exciting example of what we really want to be able to do.” Not only do we want to understand it and satisfy our curiosity, but we would also like to be able to take this knowledge and use it to make the lives of humanity better. Being able to do that translation is important,” Gropp said.