Newly launched Chinese AI app DeepSeek has surged to number one in Apple’s App Store and has … [+]
DeepSeek spells the end of the dominance of Big Data and Big AI, not the end of Nvidia. Its focus on efficiency jump-starts the race for small AI models based on lean data, consuming slender computing resources. The probable impact of DeepSeek’s low-cost and free state-of-the-art AI model will be the reorientation of U.S. Big Tech away from relying exclusively on their “bigger is better” competitive orientation and the accelerated proliferation of AI startups focused on “small is beautiful.”
Most of the coverage of DeepSeek and all of Wall Street’s reaction focused on its claim of developing an AI model that performs as well as leading U.S. models at a fraction of the training cost. Beyond being “compute-efficient” and using a relatively small model (derived from larger ones), however, DeepSeek’s approach is data-efficient.
DeepSeek engineers collected and curated a training dataset consisting of “only” 800,000 examples (600,000 reasoning-related answers), demonstrating how to transform any large language model into a reasoning model. Anthropic’s Jack Clark called this “the most underhyped part of this [DeepSeek model]
release.” Then a Hong Kong University of Science and Technology team announced it replicated the DeepSeek model with only 8,000 examples.
There you have it: we are off to the races, specifically starting a new AI race—the Small Data competition.
The Turing Post, a newsletter reporting on AI developments, called DeepSeek “one of the most exciting examples of curiosity-driven research in AI… unlike many others racing to beat benchmarks, DeepSeek pivoted to addressing specific challenges, fostering innovation that extended beyond conventional metrics.”
In the paper describing their latest AI model, DeepSeek engineers highlight one of these specific challenges: “Can reasoning performance be further improved or convergence accelerated by incorporating a small amount of high-quality data as a cold start?” The “cold start” challenge captures the lack of “experience” a reinforcement learning program has in a new situation with no prior data to guide it by showing examples of right or wrong actions. DeepSeek engineers describe the multiple stages they devised of generating, collecting and fine-tuning relevant data, culminating in “For each prompt, we sample multiple responses and retain only the correct ones.” Human ingenuity, not data-cleaning automation, at work.
Why is this innovation the most underhyped part of the DeepSeek release? Why did the $6 million training cost grab all the headlines and not the mere 800,000 examples successfully retraining large language models? Because of what I would call the Moore’s Law addiction.
Two dominant old-time U.S. Big Tech companies have been responsible for feeding and promoting this addiction. IBM invented in the 1950s the term “data processing” and became the most important computer company by stressing processing, selling speed of calculation, the superior “performance” of whatever action its large mainframes took. Anytime the mainframe choked (often because of the challenge of retrieving expanding volumes of data from wherever they were stored), IBM told their customers to buy a bigger mainframe.
When the PC era arrived, Intel took over by promoting “Moore’s Law,” convincing enterprises (and later, consumers) that bigger and faster is better. This paradigm was so entrenched that even the new “digital-born” Silicon Valley startups (e.g., Google) adopted it as their “at scale” mantra. This brings us to today’s AI “scaling laws,” the conviction that only bigger models with more data running on the latest and greatest processors, i.e., Nvidia chips, will get us to “AGI” as soon as 2026 or 2027 (per Anthropic’s Amodei, completely ignoring DeepSeek’s data-efficiency and his colleague’s observations).
Nvidia was born when a new era of “data processing” started to emerge with an added, progressively stronger emphasis on data, as in “Big Data.” In 1993, Nvidia’s three cofounders identified the emerging market for specialized chips that would generate faster and more realistic graphics for video games. But they also believed that these graphics processing units could solve new challenges that general-purpose computer chips could not.
The new challenges mostly had to do with the storage, distribution and use of the rapidly growing quantities of data and the digitization of all types of information, whether in text, audio, images, or video. In 1986, 99.2% of all storage capacity in the world was analog, but in 2007, 94% of storage capacity was digital, a complete reversal of roles. The web drove this digitization and data explosion and the development of new data management software and algorithms specially designed to take advantage of Big Data. The “perfect storm” of Big Data, improved algorithms, and GPUs led to the re-branding of a machine learning pattern recognition methodology (artificial neural networks) as “deep learning” and later as “AI.”
Already last year, we saw some movement away from the “bigger is better” paradigm. In addition to questions by practitioners and observers about the possible limits of “scaling laws,” a number of startups presented credible attempts at doing what the big guys were doing but with smaller models and/or less data. Even Nvidia has been hedging its bets, going beyond the data center by pursuing edge computing and bringing its chips to developers’ desktops.
The attention paid to DeepSeek, for right and wrong reasons, will probably accelerate this trend towards “small is beautiful.” Here’s to the new paradigm, which may become a new addiction: smaller models or even more elaborate models, all using Small Data.