- Cerebras reaches 969 tokens/second on Llama 3.1 405B, 75 times faster than AWS
- Claims an industry-lowest 240ms latency, twice as fast as Google Vertex
- Cerebras Inference runs on the CS-3 with the WSE-3 AI processor
Cerebras Systems claims to have set a new benchmark for AI performance with Meta’s Llama 3.1 405B model, achieving an unprecedented generation speed of 969 tokens per second.
Third-party benchmarking company Artificial Analysis claimed that this performance is up to 75 times faster than GPU-based offerings from leading hyperscalers. It was nearly six times faster than SambaNova at 164 tokens per second, more than 14 times faster than Google Vertex at 30 tokens per second, and far outperformed Azure at just 20 tokens per second and AWS at 13 tokens per second.
Additionally, the system demonstrated the world’s fastest time to get the first token, at just 240 milliseconds, almost twice as fast as Google Vertex at 430 milliseconds and well ahead of AWS at 1,770 milliseconds.
Extend your lead
“Cerebras holds the world record for performance of Llama 3.1 8B and 70B, and with this announcement we extend our lead to Llama 3.1 405B – delivering 969 tokens per second,” noted Andrew Feldman, co-founder and CEO of Cerebras .
“By running the largest models at instantaneous speed, Cerebras enables real-time responses from the world’s first open-boundary model. This paves the way for powerful new use cases, including multi-agent reasoning and collaboration, in the AI landscape.
The Cerebras inference system, powered by the CS-3 supercomputer and its Wafer Scale Engine 3 (WSE-3), supports a full context length of 128 KB with 16-bit precision. The WSE-3, known as the “world’s fastest AI chip,” features 44 GB of integrated SRAM, four trillion transistors, and 900,000 AI-optimized cores. It delivers peak AI performance of 125 petaflops and has 7,000 times the memory bandwidth of the Nvidia H100.
Ahmad Al-Dahle, VP of GenAI at Meta, also welcomed the latest results from Cerebras, saying: “Scaling inference is critical to accelerating AI and open source innovation. Thanks to the incredible work of the Cerebras team, the Llama 3.1 405B is now the fastest frontier model in the world. Thanks to the power of Llama and our open approach, ultra-fast, affordable inference is now within reach of more developers than ever.
Customer trials of the system are ongoing, with general availability planned for Q1 2025. Pricing starts at $6 per million input tokens and $12 per million output tokens.