The ChatGPT Moment For Robotics Is Near

How breakthroughs in LLM's will soon transfer to robotics.

Mar 26, 2025

The software powering robots is undergoing a revolution, driven by lessons from Large Language Models (LLMs) like ChatGPT in the last few years. The current state of the art is very impressive, but what is even more impressive is that this level of performance was achieved with a model estimated to be trained on 3,000 times less compute than GPT-4, the revolutionary LLM launched in March of 2023.

If you are not convinced as to how the dramatic scaling of these models will bring about the ChatGPT moment for robotics, here are a couple examples from current research that might convince you otherwise.

The model running the mobile robot shown above is called Pi-0. its an open source model that was announced in late October 2024. Based on the rough levels of compute required for a reimplementation, we can estimate that Pi-0 required approximately 3,000 times fewer GPU hours during training compared to GPT-4.

(Not counting the small VLM base model).

The video above is from Google DeepMind’s recent “Gemini Robotics” announcement earlier this month which displayed similar capabilities to Pi-0.

Vision-Language-Action (VLA) models represent a new architecture in robotics that adapts transformer technology used in LLM’s for general robot control.

To understand VLA models, let's first look at how language models work:

Language models (LLMs) have one primary job: predict the next word in a sequence
Instead of working with individual letters, LLMs use "tokens" - chunks of text that might be:
- Part of a word
- A complete word
- Several common words grouped together
While humans think in words (made from 26 letters), LLMs construct sentences using approximately 100,000 unique tokens which are identified by unique number ID’s

This is the ‘ID’ equivalent to the text above

During training, Large Language Models (LLMs) learn to predict the next token in a sequence through a process of pattern recognition across massive datasets. As the model processes billions of text examples, it develops statistical associations between tokens by minimizing prediction errors. When an LLM encounters a sequence like "The capital of France is," it calculates probability distributions across its entire vocabulary, assigning higher likelihood to tokens that frequently followed similar contexts in its training data. Through repeated exposure and optimization, the model gradually refines these probability distributions, encoding complex linguistic patterns, factual relationships, and even rudimentary reasoning capabilities within its neural network weights. This training approach enables LLMs to generate contextually appropriate continuations without explicitly programmed rules.

VLA models take this same basic transformer architecture and adapt it to help robots understand visual information and perform physical actions. The fundamental difference between VLA models and regular LLMs or VLMs (language models trained on labeled images) is in the information encoded within each token.

An LLM's job is to answer one question: "Given the context, what is most likely to come next?" In the case of an LLM, this context is simply a list of tokens. To us, these tokens look like words, or word fragments, but the language model simply sees a list of different numbers. Although these numbers are indecipherable to us humans, during the model's training, LLMs encode complex statistical patterns between each unique token ID, which captures a web of meaning and associations between tokens which allows the LLM to accurately predict what is likely to come next. Just like we can guess the next word in a sentence through understanding the meaning underlying the words.

The breakthrough in developing Vision-Language-Action models comes from recognizing that tokens don't need to represent text—they can represent any data containing patterns. With clever encoding, AI can be trained to meaningfully predict the next token in any contextual sequence. In VLA models, tokens encode a complex representation of the robot's current state rather than language. This state includes camera imagery, joint positions, sensor readings, and natural language task descriptions. VLA models are able to follow natural language commands partly due to their VLM backbone, which gives the models a degree of semantic understanding useful for predicting these state tokens with a natural language task as context. This is why when the VLA model is given a task like "Put the dishes in the bus tub and throw away the trash," it is easily able to understand what that means based on its background training from labeled images.

Unlike pure text-based models, VLA models continuously integrate feedback from the physical world, creating a learning loop where predictions are verified against real outcomes. This real-world verification process allows VLA models to refine their understanding of how language instructions map to physical actions, ultimately producing robots that can adapt to new situations based on their accumulated experience—something impossible to achieve with traditional programming approaches.

What is the potential for VLA models?

This is a comparison that one of the creators of Pi-0 came up with. Although it is difficult to translate text datasets and robotic datasets, this should give some perspective as to how much room VLA models have to grow.

As impressive as the current models are, they are achieving this performance with a tiny percentage of data when compared to similar transformers. The visualization above shows the dramatic contrast between the data used to train Pi-0 (images from cameras + robot joint positions) versus the massive datasets used to train LLMs. You might guess that data is more limited when being recorded and collected from real robots rather than scraping the web for text, but I would argue that it is both not limited, and of significantly higher quality compared to internet text.

Firstly, judging the quality of data must always be in reference to the task that you want to perform. If your task is to train an AI to be a near-perfect autocomplete engine, it makes sense to train it to look for statistical patterns within the internet's text. However, if your goal is for the robot to perform useful tasks in the real world, you need to evaluate your model differently, because text prediction will not get you all the way to reasoning and agentic capabilities. Generally, the more closely your reward function depends on untampered truthful feedback from the real world, the more effective your AI will be within the physical domain. LLMs have mastered next-word prediction with far less data than GPT-4. The smaller models are often on par or even better at producing text that sounds good when compared to the bigger models. This is because when tested on tasks closely aligned with the LLMs' reward function, such as generating the most probable next word and producing something that sounds plausible, the model naturally excels since that's precisely what it was optimized to do.

Next word prediction doesn't equate to reasoning or creativity. LLMs operate deterministically, with any response variations stemming solely from deliberately introduced randomness. On their own, LLMs struggle with most productive tasks we'd hope they could handle—tasks that typically demand human-like reasoning and planning. Recent breakthroughs came when researchers implemented post-training techniques using reinforcement learning. This approach shifts from merely predicting text to rewarding and penalizing the model based on how well its output meets verifiable criteria, such as solving mathematical puzzles or writing functional code. By training base LLMs on such verifiable tasks, models can be fine-tuned to excel at these specific objectives.

This focus on verifiable feedback brings us to the core advantage of robot training data. Robot interaction data represents a fundamental advantage over internet text because it's directly sourced from the domain where robots must ultimately perform. Every camera frame contains truthful, physics-abiding information—pixels that reflect unambiguous physical interactions rather than human interpretations. This distinction is critical: while text data is riddled with errors, biases, and abstracted representations that rarely approximate physical reality, visual data from robots captures ground truth. Text simply cannot encode the richness and precision of physical interactions that robots need to master.

The verifiability of robotic data creates another significant advantage. In physical environments, task completion can be verified through dedicated neural networks analyzing visual feedback, environmental markers like QR codes, or even through VLM self-correction. But simulation environments like NVIDIA Omniverse offer even more powerful verification by providing complete access to object positions, rotations, and velocities. Consider how a robotic hand trained in simulation can rapidly learn to manipulate objects it's never encountered before—something impossible to learn from text descriptions alone. This allows for extremely fast iteration cycles, enabling robots to learn generalizable skills that transfer to the physical world with minimal loss.

This highlights a crucial misalignment in current AI approaches: we're using LLMs optimized for text prediction to perform reasoning tasks that fundamentally don't align with their core objective. While you can achieve impressive-sounding text completion with relatively modest data, genuine reasoning requires something different. The remarkable improvements we've seen from reinforcement learning with LLMs demonstrate this point—they perform better not because of more data or compute, but because RL aligns the model with verifiable tasks rather than mere text prediction. This explains the severe diminishing returns we observe when scaling data and compute for LLMs on reasoning tasks, compared to the benefits of improved training alignment.

What makes VLA models revolutionary is how they outperform both traditional hard-coded robots and even those trained through pure reinforcement learning. Unlike conventional robots that require explicit programming for each task variation, or pure RL robots that lack semantic understanding, VLA models combine the physics-grounded learning from real-world interaction with the semantic understanding provided by their VLM backbone. This semantic understanding creates a crucial bridge between natural language instructions and physical actions, allowing robots to generalize across tasks and interpret human intentions in ways that purely hard-coded or RL-trained systems cannot. As both training and inference scaling continue, this hybrid approach has the potential to dramatically outperform traditional robotic systems precisely because the data is inherently higher quality and the semantic understanding allows for unprecedented flexibility in responding to human needs in the physical world.

Looking back at the image I shared earlier, consider this: If LLMs have advanced so dramatically since ChatGPT's release, imagine what will happen when companies apply those same scaling principles to VLA models. The robotics revolution that's coming will fundamentally transform many vital industries. As I continue to explore these capabilities by running Pi-0 on my own robot arm, I'm constantly reminded that we're witnessing the beginning of a profound shift in robotics. What's unfolding now with VLA models is a fundamental turning point for robotics—where robots understand our world, interpret our intentions, and seamlessly translate language into physical action. Although there are still very hard technical problems to be solved, the gap between science fiction and reality is closing faster than most realize, and the implications for how this will effect the way we live and work are only beginning to emerge.

What’s next?

I am currently experimenting with running Pi-0 inference on the open source so-100 robot arm. Due to this project being fully open-source, I will be able to fine-tune pi-0 to perform specific tasks that I come up with. This should be awesome!

Leader_And_Follower — This is a image of two so-100 robot arms. The one on the left is the follower arm, and the one on the right is the leader arm, which is used to tele-operate the follower arm in order to train it tasks through imitation learning.

Sources

Hausman, K., et al. "π₀: A Vision-Language-Action Flow Model for General Robot Control." Arxiv, October 2024.
- Details the Pi-0 model, announced in October 2024, including its training on 10,000 hours of robot data and its Vision-Language-Action (VLA) architecture for general robot control.
Physical Intelligence. "Our First Generalist Policy." Blog, October 30, 2024.
- Official announcement of Pi-0, highlighting its performance with less data than comparable models and its open-source availability for robotics research.
Google DeepMind. "Gemini Robotics." March 11, 2025.
- Introduces Gemini Robotics, announced in March 2025, showcasing advanced robotic capabilities integrating vision, language, and action, similar to Pi-0.
Brohan, A., et al. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." Google DeepMind, July 2023.
- Explains the VLA model architecture, adapting transformer technology from LLMs for robotic control, providing foundational context for Pi-0 and Gemini Robotics.
TheRobotStudio. "SO-ARM100: Standard Open Arm 100." GitHub Repository.
- Open-source documentation for the SO-ARM100 robot arm, including 3D printing files and guides, used for running Pi-0 inference and tele-operation experiments.
Originality.AI. "65+ Statistical Insights into GPT-4: A Deeper Dive into OpenAI’s Latest LLM." Blog, March 2023.
- Provides estimated compute figures for GPT-4 (e.g., $78 million in compute), supporting the article’s comparison of Pi-0’s efficiency with 3,000 times less compute.
Seeed Studio. "SO-ARM100 3D-Printed Robotic Arm Frame.
- Commercial resource for the SO-ARM100, detailing its design for AI robotics projects, complementing the article’s personal experimentation section.

Aaron’s Substack

Discussion about this post