Jimmy Cerone

August 27, 2025

Link of the Day: There are no new ideas in AI — only new datasets

There are no new ideas in AI — only new datasets

Yesterday I wrote about the applications of AI in robotics. Frankly, it was more theory than substance. Today I'm bringing a little bit more substance. First, this article tracks the progression of AI development better than I did, laying it out like so: 

  1. Deep neural networks: Deep neural networks first took off after the AlexNet model won an image recognition competition in 2012.
  2. Transformers + LLMs: In 2017, Google proposed transformers in Attention Is All You Need, which led to BERT (Google, 2018) and the original GPT (OpenAI, 2018).
  3. RLHF: This was first proposed (to my knowledge) in the InstructGPT paper from OpenAI in 2022.
  4. Reasoning: In 2024, OpenAI released O1, which led to DeepSeek R1.

Better yet, the author lays out a strong thesis for my "commoditization" of LLMs, though they have some theory to back it up. The theory is based on an AI paper called Bitter Lesson which says: 

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin...Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation...the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation.  

What's interesting about the implications of this thesis is that it implies we will see mostly incremental improvement in LLMs rather than step changes. As noted in the article, we've mostly used up the text available for training: 

Transformers unlocked training on “The Internet” and a race to download, categorize, and parse all the text on the web (which it seems we’ve mostly done by now).

Outside of the Reinforcement Learning from Human Feedback (RLHF) and reasoning breakthroughs in 2022 and 2024, we've mostly seen incremental improvements in our ability to run systems. These breakthroughs are important for scaling these discoveries, but they are not step changes. An example is the 33x reduction in energy usage for inference achieved by Google. We need these breakthroughs to scale and apply the technology. But they are not going to get us to AGI.

For that, this article argues, we need new data sets. My prediction is we are going to see a bunch of specialized AI applications with proprietary data (Facebook is doing this with ad targeting, Waymo with cars). The new data will likely come from / for robotics and 3D world mapping, as I argued yesterday. That said, the article does make a strong case for Google. Though you could argue they are already making use of that data well given their latest video model breakthrough.

A couple of links come to mind that I can't connect well here, but still feel relevant:

- Elon Dreams and Bitter Lessons
- Of Flying Cars and the Declining Rate of Profit