What is a world model, in one short answer?
If a language model predicts the next token, a world model tries to predict the next state of a world. That world can be a generated 3D space, a simulated driving scene, an environment for an agent, or an interactive scene that responds when a user moves through it.
This is why the category matters. A normal image or video model can make a plausible-looking output. A world model has to maintain consistency when the camera moves, objects change, actions happen, and time continues.
What it is not
A world model is not just another word for a video model. Video can be part of the interface, but the deeper goal is simulation: representing how a world behaves after a prompt, movement, or action.
| Dimension | Video model | World model |
|---|---|---|
| Primary output | A fixed video sequence | A stateful environment that can change with actions |
| Interaction | Usually prompt to clip | Prompt or action to evolving world state |
| Core challenge | Visual realism and temporal coherence | Spatial memory, causality, controllability, and persistence |
| Typical use | Creative media generation | Simulation, spatial design, robotics, agent training, interactive media |
| Evaluation question | Does the clip look plausible? | Does the world behave consistently when explored or acted on? |
Why it is becoming a separate category
The phrase now has several visible product and research tracks: DeepMind uses it for interactive generated worlds, World Labs uses it for spatial 3D worlds, Runway uses it for a general world model research direction, and NVIDIA uses world foundation models for physical AI workflows.
That spread is exactly why World Models Watch treats the term as a category, not a single product label.



