Genie 3

Deepmind's new model is mindblowing, but the videos hint at remaining challenges.

DeepMind recently debuted Genie 3, a video-generation world model, to well-deserved enthusiasm. In less than two years, the model has made an incredible leap in capabilities. The Genie team has scaled from five-second blurry snippets of 2D platformers to immersive, minute-long, high-res interactions in self-consistent worlds. The model even runs in real time (no system requirements though.)

To show off the model, they released impressive demos—dragon flying, jet skiing and even an eerie world within a world. But demos like these are cherry-picked to highlight a model’s strengths. I wanted to note a few things I didn’t see in the Genie 3 demos; not as a nitpick, but to flag open research directions in world modelling for embodied AI.

Dynamic Environments

Most environments shown in the videos are relatively static. That isn’t to say there’s no physics: we see water splashes, dust clouds, rigid-body physics and mirrored reflections. But Genie is likely trained on mostly games, and those are game physics. When the model strays away from game-like interactions things can seem artificial—see the jetski below. That’s corroborated by Tejas Kulkarni on Twitter; behind the curtains, physics simulation is still limited.

Special thanks to @GoogleDeepMind for inviting me to try out Genie 3. I'm excited to share my thoughts on this early research prototype and also some of my live recordings below:

I spent the whole day playing with the system and when it works, it is truly mind blowing🤯. It is… pic.twitter.com/JPW5sPEeF5
— Tejas Kulkarni (@tejasdkulkarni) August 5, 2025

As long as games provide the majority of the pretraining corpus, it’s plausible that physics will be a challenge; each new interaction takes work, so devs naturally leave most of a game world static. Plus, there’s plenty that game engines can’t render in real time, like cloth-on-cloth interactions, feathers and fur (more on fur rendering here). The static feel is reinforced by a lack of other agents interacting with the environment (unless user-triggered by a world event). It’s possible that the model struggles with coherence with too many dynamic entities in the environment, or can’t model multi-agent interactions.

The collision at 0:25 transfers no momentum to the other boat, in contrast to realistic interactions at 0:05.

The agent doesn't collide with the foliage, even walking through it at 0:39.

Interaction

For the most part, the demos only demonstrate navigation. Agents traverse a world and reveal their surroundings, but they don’t do much. That could be due to a limited action space; the agent has access to a single “Act” button that needs to work in all environments. But it’s also unclear how to define a richer action space. There’s an almost fractal complexity here - should the agent press “play piano”, specify notes, or puppeteer each finger individually? Again, because of the games in the corpus, it’s unsurprising that the space is limited to high-level motions (eg. “press F to pay respects”). And it’s not obvious to me how you would generate lower-level action conditioning without real-world interactions, more along the lines of the work being done on robotics foundation models.

To be clear, this is a hard problem. Unbounded interactivity means virtually infinite world states that the model has to be able to handle. The bottleneck isn’t just physics, it’s also real world knowledge. One demo shows an agent in an industrial bakery. To simulate arbitrary interactions, the model would need to know everything from bread baking and operating industrial mixers to how fire propagates in a floury environment (CC Samuel Pepys). DeepMind’s game-playing agent SIMA can walk to a hose, but can’t take it off the wall and water the garden. It’ll be interesting to see whether future Genie models are capable of modelling deep interaction via self-supervised learning, or whether real world interaction is a necessary component of the training process.

The agent navigates to a hose as instructed, but that's all it can do.

Action Consistency and Steerability

At different steps in a Genie 3 world, actions can be subtly different. And sometimes they aren’t required at all. Take these examples:

At 0:22 the characters jumps without the agent pressing a button.

Initially, the cat follows the path without action inputs. ↑ initially faces the cat forwards, but later directs it backwards. The jump button only works inconsistently.

Hitting ↑ has no effect on the boat's forward motion.

I think the cause is a conflict between visual consistency and steerability. The critter jumping over a gap is likely consistent with footage in the training data, as most players successfully jump the gap. By forcing a jump, Genie keeps the videogen safely in distribution. But railroading the agent impedes steerability. If the agent doesn’t jump, the critter should fall. A lack of fidelity to actions could be problematic for agent learning within world models; if the world adjusts itself for mistakes, agents have no need to avoid them. That might be ok in training, but useful agents will need to transfer to the less-forgiving real world.

Conclusion

To reiterate, the model is extremely impressive. With Genie, DeepMind has energised research into videogen world models (see Matrix-GAME and OASIS). The models reflect a natural evolution of the Open-Endedness research agenda; developing truly general agents which can continually interact and learn in an environment. Given the blistering rate of progress, I wouldn’t be shocked to see these limitations solved in future releases. Fingers crossed.

Posted on August 15, 2025