01Known worlds
Games and metaverse taught the interface.
People already understand avatars, spaces, inventories, maps, and shared places.
World model concept map
Use the world model concept map to connect AI video, spatial computing, digital twins, physical AI, and generated worlds.
Concept mapCore sentence
This sentence is the spine of the site. Minecraft and Roblox explain the user mental model. The metaverse explains persistence and social space. Vision Pro explains spatial computing. EMO, Veo, Wan, Kling, and Ray explain controllable video. Cosmos and digital twins explain simulation. World models connect all of it into one technical future.
Scene explainer
The map works best as a consumer story: start from what people already know, then reveal the new layer.
01Known worlds
People already understand avatars, spaces, inventories, maps, and shared places.
02Generated media
Synthetic scenes became easy to watch, remix, and share, but still behaved like clips.
03World models
The scene starts to remember space, respond to action, and support agents or physical simulation.
Concept flow
Past interface
Before world models, people already understood avatars, sandbox worlds, social rooms, and user-built spaces.
Current surface
AI video is the visible surface of the shift. The deeper issue is control, consistency, and memory across time.
Spatial interface
Vision Pro and spatial computing are not the same thing as world models. They are how generated worlds may be seen and operated.
Industrial layer
The industrial version of world models is not entertainment. It is simulation for robots, vehicles, factories, and cities.
Core capability
The core shift is from generating isolated outputs to modeling how a world changes under time, viewpoint, and action.
Past interface
Before world models, people already understood avatars, sandbox worlds, social rooms, and user-built spaces.
Simple characters make virtual presence easy to understand. The form is basic, but the mental model is powerful: a person can enter a world.
Minecraft and Roblox trained users to expect worlds that can be modified, extended, and shared.
The metaverse idea framed virtual worlds as social, persistent, and identity-driven, even when the tooling was still manual.
Current surface
AI video is the visible surface of the shift. The deeper issue is control, consistency, and memory across time.
EMO makes the control problem visible: the same identity needs to move, emote, sing, and stay coherent over time.
Veo, Wan, Kling, Ray, and earlier systems like Sora turn text, images, audio, and references into moving scenes.
MetaHuman-style characters, Roblox avatars, and EMO-like portraits point to a future where generated characters need continuity.
Spatial interface
Vision Pro and spatial computing are not the same thing as world models. They are how generated worlds may be seen and operated.
Apple Vision Pro reframes computing as something placed into space instead of locked inside a flat screen.
NeRF, Gaussian splatting, and scan-to-3D workflows make real or imagined spaces computable.
VR, AR, and mixed reality make the user feel located inside a generated or captured environment.
Industrial layer
The industrial version of world models is not entertainment. It is simulation for robots, vehicles, factories, and cities.
Digital twins model real places and systems so teams can test changes before touching the physical world.
Robots and autonomous vehicles need models of how environments respond to motion, contact, and decisions.
Large geospatial models connect AI to real-world places, maps, and location-aware behavior.
Core capability
The core shift is from generating isolated outputs to modeling how a world changes under time, viewpoint, and action.
A world model should preserve a coherent state when the user moves, edits, or acts.
Foundation models can become reusable infrastructure for generating, predicting, and testing world states.
Agents need environments where they can observe, act, fail, and learn.
Bridge table
| Entry concept | Known for | Connects to | Meaning inside world models |
|---|---|---|---|
| Blocky avatars / Minecraft | Simple identity inside a buildable world | Avatars, sandbox worlds, UGC | Generated worlds need persistent users, objects, and editable structure. |
| Metaverse | Persistent social virtual spaces | VR, Horizon Worlds, social identity | World models automate world creation instead of relying only on manual building. |
| Vision Pro | Spatial computing and immersive interface | AR, spatial video, 3D interaction | Generated worlds need a spatial interface for viewing, editing, and operation. |
| AI video | Generated motion, characters, and scenes | EMO, Veo 3.1, Wan2.7-Video, Kling, Ray | The video layer must become controllable, continuous, and stateful. |
| Digital twins | Simulation of real systems | Omniverse, robotics, LingBot-VA, LingBot-VLA, city and factory models | World models become useful when they predict and test real-world behavior. |
| World model | Predicting and generating world state | Genie 3, Marble, Cosmos, GWM-1 | The final category is not a place or device; it is the model that makes worlds behave. |