World model (artificial intelligence)

From Wikipedia, the free encyclopedia

A world model in artificial intelligence is a machine learning system that builds an internal representation of an environment. The model predicts how that environment changes over time in response to actions. Researchers design world models to help agents plan, reason, and act without constant real-world trial and error. World models differ from systems that merely classify or generate outputs. They simulate dynamics such as physics, object interactions, and causality. Early ideas date to the 1990s. Modern versions power robots, autonomous driving, and interactive video generation.

History

Jürgen Schmidhuber introduced the term world model in machine learning in 1990.[1] He proposed recurrent neural networks that predict future states from observations and use those predictions to train agents. David Ha and Schmidhuber revived the concept in a 2018 paper. Their agents learned to drive virtual cars and play video games inside self-generated simulations.

Yann LeCun advanced the idea in a 2022 position paper titled "A Path Towards Autonomous Machine Intelligence".[2] He argued that intelligence requires predictive models of the world rather than pure pattern matching. LeCun proposed the joint embedding predictive architecture as a practical foundation. LeCun and collaborators developed several JEPA variants. V-JEPA 2 reached state-of-the-art performance on video understanding and physical reasoning at the time.[3] It supports zero-shot robot control in unfamiliar environments.[3] Introduced in March 2026, LeWorldModel trains stably end-to-end from raw pixels and uses two loss terms and avoids hand-crafted heuristics.[4]

Google DeepMind introduced Genie in 2024. The model learned interactive environments from unlabeled internet videos. Genie 2 followed in late 2024 and added three-dimensional generation. The Genie series set benchmarks for general-purpose simulation.

Genie 3 was introduced in August 2025. It produces photorealistic, real-time interactive worlds from text prompts which are displayed at 24 frames per second and explored in real time with text or image prompts. The model supports persistent three-dimensional worlds and real-time interaction.[5] Waymo adopted Genie 3 in February 2026 and used it to create a specialized world model for autonomous driving simulation, called the Waymo World Model. It produces synchronized camera and lidar outputs and creates edge cases that real robotaxis rarely encounter. The edge cases were reported to be unusal by PCMag.[6]

General Intuition announced a $133.7 million seed round. World Labs raised $1 billion. AMI raised $1.03 billion.[7]

In April 2026, Alibaba announced Happy Oyster, its world model designed for real-time and “flowy” world model. It includes a directing mode for world building based on text and image prompts and a wandering mode for exploring the resulting world. It can generate 3-minute in-world video clips.[8]

Also in April, World Labs, co-founded by Li Fei Fei, unveiled Spark 2.0, an open-source 3D Gaussian splatting rendering engine that targets smartphone-class devices.[8]

Architecture

World models process raw sensory data such as video frames or lidar scans. They compress this input into compact latent representations. The system then predicts future representations rather than pixel-by-pixel reconstructions.

Many modern world models use joint embedding predictive architecture (JEPA). An encoder turns observations into embeddings. A predictor estimates one or a suite of embeddings from the current one and an action. In some cases a critic chooses one embedding as the best result. A regularizer keeps embeddings well-behaved.[3]

The model trains by minimizing prediction error in embedding space. This approach avoids the high cost of generating every detail. Some architectures add explicit components. A fast reactive path handles immediate responses. A slower deliberative path performs longer-horizon planning. Video prediction accuracy or robot success rates are key metrics, but do not always predict real-world performance.

Generative world models such as Genie 3 combine these with a simulator. They accept text prompts or layouts and output consistent video, lidar, or three-dimensional scenes. World models often train with self-supervised learning. They use large unlabeled datasets of video or robot interactions. Self-supervised learning can speed learning. Reinforcement learning can fine-tune a model for specific tasks.

Applications

World models support robot learning. Agents train inside simulations and transfer skills to the physical world. This reduces the need for dangerous or expensive real-world trials. Autonomous vehicles use world models to test rare events.[9] Waymo’s system simulates tornadoes or unusual pedestrian behavior. Companies train planners without putting vehicles on public roads. Interactive entertainment benefits from world models. Genie 3 lets users generate playable environments from simple descriptions. Game studios prototype levels faster. Scientific simulation gains from these models. Researchers model physical systems or biological processes at scale. Planners in logistics or urban design test strategies inside accurate digital twins.

Comparison with large language models

Both world models and large language models (LLMs) use inferencing on their inputs to make predictions.

LLMs operate on textual inputs. They predict the next token in text sequences. They excel at language-oriented tasks such as translation or summarization. However, they lack understanding of physics.

World models operate on sensor inputs such as pixels. They predict state changes in that data in latent space. This design supports planning and causal reasoning.

LLMs generate fluent text but often fail at consistent physical predictions. Their architecture employs transformers with refinements such as mixture of experts.

World models divide an inferencing task into work performed by encoders, predictors, simulators, and other pieces. They typically handle multimodal inputs such as video, lidar, radar, and audio, guided by textual prompting.

LLMs power chatbots and code assistants. World models drive embodied agents that act in dynamic environments, such as autonomous driving. The two may be combined in hybrid systems. For example, a LLM handles instructions, while a world model manages low-level control. World model proponents such as LeCun claim that because LLMs are trained only on text, they have no ability to predict anything beyond text, such as real-world events.[2]

Benchmarks

World model benchmarks test physical understanding, long-term consistency, planning, and generalization from sensor data.

Meta introduced three benchmarks for V-JEPA 2.[10]

  • IntPhys 2 measures a model's ability to detect physics violations. It presents pairs of videos that diverge when one breaks physical rules. Humans score near 100% accuracy. V-JEPA 2 achieves little better than random chance on many conditions.[11]
  • Minimal Video Pairs (MVPBench) tests physical understanding through multiple-choice questions based on short video clips. It probes object interactions and causality.[12]
  • Something-Something tests action recognition.[13]
  • Epic-Kitchens-100 tests human action anticipation.

DeepMind benchmark:

  • Interactive evaluation measures consistency over minutes of interaction, memory of off-screen objects, and response to user actions or text prompts."Genie 3: A new frontier for world models". Google DeepMind. August 5, 2025. Retrieved March 26, 2026.

Waymo benchmark:

  • Output generation quality: Metrics include realism, controllability (via text prompts), and usefulness for training planners in simulated worlds. However, pixel reconstruction error rate with episodic rewards often fails.

Other:

  • Epic-Kitchens-100 (often measured with Recall@5)[14]
  • Ego4D
  • 50 Salads, Breakfast, etc.

Potential benchmarks:

  • Zero-shot transfer to robots
  • Long-horizon planning
  • Implausible prediction rate

See also

References

Related Articles

Wikiwand AI