Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more
The team of AI researchers known as We Search is currently doing something unique in the rapidly evolving field of generative AI (at least to my knowledge): We are pre-training a new 15 billion parameter extended language model (LLM) at l using machines distributed across the Internet and the world, avoiding the need to concentrate model development as has traditionally been done in expensive, power-hungry AI data centers and “superclusters” of graphics processing units (GPU) like the one recently completed by Elon Musk’s xAI in Memphis, Tennessee.
Additionally, Nous live streams the pre-training process on a dedicated website — distro.nousresearch.com — showing its performance against the evaluation criteria as it happens, as well as a simple map of the various locations of the training material behind the exercise, including several locations across the United States and Europe.
At the time of publishing this article, there were approximately 57 hours (2.3 days) of pre-training remaining with over 75% of the process completed.
Pre-training is the first of the two and arguably more fundamental aspects of training an LLM, as it involves training the model on a large corpus of text data to learn statistical properties and language structures. The model processes large textual data sets, capturing patterns, grammar and contextual relationships between words. This step gives the model a broad understanding of the language, allowing it to generate coherent text and perform various language-related tasks.
After pre-training, the model is fine-tuned on a more specific dataset tailored to particular tasks or domains.
If successful, We will prove that it is possible to train frontier-class LLMs without the need for expensive superclusters or low-latency transmission, using a new open source training method. This could usher in a new era of distributed AI training as a major, even potentially dominant, source of new AI models and shift the balance of power in AI generation from large, well-funded tech companies to groups smaller and non-corporate actors. .
We DisTrO: the technology behind exercise training
We, which made headlines earlier this year for the release of its permissive and existentially conflicted variant of Meta Llama 3.1 Hermes 3 and its overall mission to make AI development personalized and unrestricted, uses its technology to open source distributed training called Nous DisTrO (Distributed Training Over-the-Internet), which Nous initially published in a research paper in August 2024.
According to the recent Nous Research publication, DisTrO reduces inter-GPU communication bandwidth requirements by up to 10,000 times during pre-training. This innovation allows models to be trained on slower, more affordable Internet connections (potentially as low as 100 Mbps download and 10 Mbps upload) while maintaining competitive convergence rates and loss curves.
The main advancement of DisTrO lies in its ability to efficiently compress data exchanged between GPUs without sacrificing model performance.
As described in an August 2024 VentureBeat article, the method reduced communications requirements from 74.4 GB to just 86.8 MB in a test using a Llama 2 architecture, an efficiency gain of nearly 857 times. This dramatic improvement paves the way for a new era of decentralized and collaborative AI research.
DisTrO builds on previous work on Decoupled Moment Optimization (DeMo), an algorithm designed to reduce inter-GPU communication by orders of magnitude while maintaining training performance comparable to traditional methods.
Both the DeMo algorithm and the DisTrO stack are part of Nous Research’s ongoing mission to decentralize AI capabilities and bring advanced AI development to a broader audience.
The team also made the DeMo algorithm available as open source code on GitHub, inviting researchers and developers around the world to experiment and build on their findings.
Hardware partners
Pre-training Nous Research’s 15 billion parameter language model required input from several notable partners, including Oracle, Lambda Labs, Northern Data Group, Crusoe Cloud, and Andromeda Cluster.
Together, they provided the heterogeneous hardware needed to test DisTrO’s capabilities in a real-world distributed environment.
Profound implications for future development of AI models
The implications of DisTrO extend beyond technical innovation. By reducing reliance on centralized data centers and specialized infrastructure, DisTrO paves the way for a more inclusive and collaborative AI research ecosystem.
Small institutions, independent researchers, and even hobbyists with access to consumer internet and GPUs can potentially train large models, a feat previously reserved for companies with significant capital and expertise.
Diederik P. Kingma, co-author of the research paper and co-inventor of the Adam optimizer, joined Nous Research as a collaborator on the development of DeMo and DisTrO. Kingma’s contributions, alongside those of Nous Research co-founders Bowen Peng and Jeffrey Quesnelle, lend credibility to the project and signal its potential impact on the broader AI community.
Next steps
Nous Research has opened the door to a future where AI development will no longer be dominated by a handful of companies. Their work on DisTrO demonstrates that with the right optimizations, large-scale AI models can be trained efficiently in a decentralized manner.
Although the current demonstration uses cutting-edge GPUs like the Nvidia H100, the scalability of DisTrO to less specialized hardware remains an area to explore further.
As Nous Research continues to refine its methods, the potential applications of this technology, ranging from decentralized federated learning to training diffusion models for image generation, could redefine the boundaries of AI innovation.