Logo
Overview
Fei-Fei Li's RTFM: Real-Time 3D World Model Running on Single H100 GPU

Fei-Fei Li's RTFM: Real-Time 3D World Model Running on Single H100 GPU

October 16, 2025
13 min read

On October 16, 2025, Fei-Fei Li and World Labs unveiled RTFM (Real-Time Foundation Model), a groundbreaking real-time, persistent, and 3D-consistent generative world model that runs on a single NVIDIA H100 GPU. This achievement represents a paradigm shift in spatial AI, enabling interactive 3D environments that maintain coherence across time and space—without requiring massive computational clusters. RTFM builds on World Labs’ $230 million funding round and positions Li’s vision of Large World Models (LWMs) as a credible counterpart to Large Language Models (LLMs), promising to transform robotics, gaming, AR/VR, autonomous vehicles, and virtual production.

What is RTFM?

Real-Time Foundation Model for 3D Worlds

RTFM stands for Real-Time Foundation Model, a neural architecture designed to generate and maintain coherent 3D environments with the following characteristics:

1. Real-Time Performance

  • Runs inference on single H100 GPU (~$30,000 hardware)
  • Frame rates sufficient for interactive applications
  • No need for distributed computing or GPU clusters

2. Persistence

  • Maintains scene state across interactions
  • Objects remain where placed; changes persist over time
  • Supports continuous exploration and modification

3. 3D Consistency

  • Geometrically coherent from all viewing angles
  • No “cardboard cutout” effects or 2D projections
  • Objects have depth, volume, and spatial relationships

4. Generative Capabilities

  • Create 3D scenes from text prompts or images
  • Modify existing environments dynamically
  • Add, remove, or transform objects in real-time

Comparison to Existing Models:

  • NeRF (Neural Radiance Fields): 3D consistent but requires hours of rendering
  • Gaussian Splatting: Fast but limited generative capabilities
  • Diffusion-based 3D (DreamFusion, Magic3D): Generative but slow, not real-time
  • RTFM: Combines speed, consistency, and generative power

Technical Architecture

How RTFM Works

While World Labs hasn’t released full technical details, based on Li’s announcement and industry analysis:

Core Components:

1. 3D Latent Diffusion

  • Operates in a learned 3D latent space (not 2D image space)
  • Encodes geometric relationships and spatial structure
  • Enables consistent multi-view rendering

2. Neural Radiance Field Integration

  • Leverages NeRF-style representations for 3D geometry
  • Optimized for real-time inference via distillation or sparse sampling
  • Supports dynamic scene updates

3. Transformer-Based Attention

  • Spatial attention mechanisms understand object relationships
  • Temporal attention maintains consistency across frames
  • Text-conditioned attention for prompt-guided generation

4. Efficient Rendering

  • Novel view synthesis in real-time (not pre-rendered)
  • Likely uses techniques like:
    • Sparse voxel grids for efficient memory usage
    • Level-of-detail (LOD) rendering for distant objects
    • Deferred shading for complex lighting

Training Data:

  • Millions of 3D scans, video sequences, and synthetic environments
  • Paired with text descriptions for language grounding
  • Likely includes datasets like Objaverse, ShapeNet, and proprietary captures

Inference Optimization:

  • Model compression techniques (quantization, pruning)
  • Custom CUDA kernels for H100 Tensor Cores
  • Caching strategies to avoid redundant computation

Capabilities Demonstrated

1. Text-to-3D Scene Generation

Example Prompt: “Modern loft apartment with floor-to-ceiling windows, minimalist furniture, and cityscape view”

RTFM Output:

  • Full 3D environment navigable in real-time
  • Consistent lighting and shadows from multiple angles
  • Detailed textures and materials (wood grain, glass reflections)

Use Case: Architects and interior designers can visualize spaces before construction

2. Image-to-3D Expansion

Input: Single photo of a living room

RTFM Output:

  • Infers depth and structure from 2D image
  • Generates plausible geometry for occluded areas (behind furniture, around corners)
  • Allows user to “walk around” the reconstructed space

Use Case: Real estate marketing—convert photos into virtual tours

3. Interactive Scene Editing

Scenario: User generates a 3D forest scene

Interactions:

  • Add: “Place a wooden cabin next to the lake” → Cabin appears with appropriate scale and orientation
  • Remove: “Delete the boulder on the left” → Boulder disappears; terrain fills in naturally
  • Transform: “Make the trees taller” → Trees scale up while maintaining realistic proportions

Persistence: Changes remain when user navigates away and returns

Use Case: Game developers prototype environments rapidly

4. Dynamic Lighting and Weather

Prompts:

  • “Change time to sunset” → Lighting shifts to golden hour; shadows lengthen
  • “Add heavy rain” → Water effects, puddle reflections, darkened sky

Physics Integration:

  • Rain interacts with geometry (runs down surfaces, pools in depressions)
  • Lighting affects material appearance (wet surfaces become reflective)

Use Case: Film production—previsualize scenes under different conditions

Why Single-GPU Performance Matters

Democratization of 3D AI

Previous State of Art:

  • Google’s DreamFusion required multiple TPUs for slow 3D generation
  • NVIDIA’s Instant NeRF fast but limited to static reconstruction
  • Unreal Engine 5’s Nanite/Lumen requires high-end multi-GPU setups

RTFM’s Breakthrough:

  • **H100 GPU (~30,000)vs.GPUclusters(30,000)** vs. GPU clusters (500,000+)
  • Single workstation vs. data center access
  • Researchers, indie devs, and small studios can experiment

Impact on Accessibility:

  • University labs can run cutting-edge 3D AI research
  • Indie game studios can prototype AAA-quality environments
  • Individual creators can build immersive VR experiences

Real-Time Interaction Unlocks New Applications

Latency Thresholds:

  • <100ms: Perceived as instantaneous (VR, AR, gaming)
  • <1s: Acceptable for interactive design tools
  • >10s: Limited to offline/batch processing

RTFM’s Performance: Based on “real-time on single H100,” likely achieves sub-second inference for scene modifications, enabling:

  • VR/AR: Users explore generated worlds without motion sickness (low latency critical)
  • Gaming: NPCs and environments adapt dynamically to player actions
  • Robotics: Robots simulate potential actions in 3D before execution

Scalability and Deployment

Edge Deployment Potential:

  • H100 is datacenter-class, but future optimization could target:
    • H100 NVL (smaller form factor)
    • A100 or L40S (more affordable)
    • Eventually, consumer GPUs (RTX 5090, etc.)

Cloud Services:

  • World Labs could offer RTFM via API (similar to OpenAI, Anthropic)
  • Pay-per-generation or subscription models
  • No hardware investment required for users

Fei-Fei Li’s Vision: Large World Models

From ImageNet to World Models

Fei-Fei Li’s Career Arc:

2009: ImageNet

  • Created ImageNet dataset (14 million labeled images)
  • Enabled deep learning revolution in computer vision
  • Foundation for AlexNet, ResNet, and modern AI

2025: World Labs

  • Founded in 2024 with $230 million funding (Andreessen Horowitz, Radical Ventures)
  • Mission: Build Large World Models that understand 3D space as LLMs understand language

Philosophy: “Language models gave machines the ability to understand words. World models will give them the ability to understand reality.”

What Are Large World Models (LWMs)?

Definition: AI systems that learn spatial reasoning, physical intuition, and temporal dynamics from 3D data—analogous to how LLMs learn from text.

Key Differences from LLMs:

AspectLLMsLWMs
InputText tokens3D geometry, video, sensor data
OutputText generation3D scenes, physics predictions
ReasoningLinguistic patternsSpatial relationships, causality
ApplicationsChatbots, writingRobotics, simulation, VR

Training Paradigm:

  • LLMs: “Read the internet” to learn language
  • LWMs: “Observe the world” (via video, 3D scans, simulations) to learn physics and geometry

RTFM as LWM Prototype

How RTFM Embodies LWM Principles:

  1. Spatial Understanding: Knows that objects have fronts/backs, insides/outsides
  2. Physics Awareness: Generates scenes where objects obey gravity, don’t overlap
  3. Temporal Coherence: Maintains consistency across time (persistence)
  4. Generalization: Creates plausible 3D content for unseen prompts

Long-Term Vision: RTFM is an early step toward general-purpose world simulators—systems that can predict “what happens next” in physical reality, enabling:

  • Embodied AI: Robots that plan actions by simulating outcomes
  • Scientific discovery: Test hypotheses in virtual environments
  • Entertainment: Procedurally generated infinite worlds

Applications and Use Cases

1. Robotics and Embodied AI

Challenge: Robots need to understand 3D space to navigate and manipulate objects

RTFM Solution:

  • Mental simulation: Robot generates 3D model of environment
  • Action planning: Simulates grasping, moving, placing objects
  • Validation: Tests plan in virtual world before physical execution

Example: Robot tasked with “organize cluttered desk”

  1. RTFM generates 3D scene from robot’s camera feed
  2. Robot simulates different arrangements
  3. Selects optimal strategy (fewest movements, stable stacking)
  4. Executes in real world

Impact: Safer, more efficient robots for homes, warehouses, hospitals

2. Gaming and Virtual Worlds

Challenge: Hand-crafted game environments expensive and time-consuming

RTFM Solution:

  • Procedural generation: Create infinite unique levels from prompts
  • Dynamic adaptation: Environments change based on player actions
  • Rapid prototyping: Designers iterate on ideas in real-time

Example: Open-world RPG development

  • Designer prompts: “Medieval village on mountainside”
  • RTFM generates layout, buildings, terrain
  • Designer refines: “Add marketplace, move blacksmith closer to gate”
  • Changes appear instantly

Economic Impact: Indie studios produce AAA-quality content; AAA studios slash development time

3. Architecture and Real Estate

Challenge: Visualizing unbuilt spaces difficult for clients

RTFM Solution:

  • Virtual walkthroughs: Clients explore 3D models before construction
  • Design iteration: Architects test layouts, lighting, materials interactively
  • Cost visualization: See how budget impacts finishes and features

Example: Homebuyer customization

  • Buyer prompts: “Show me with hardwood floors and skylights”
  • RTFM updates 3D model instantly
  • Buyer explores, decides, finalizes purchase

Market Disruption: Reduces reliance on physical showrooms and mockups

4. Film and Virtual Production

Challenge: Pre-visualization (previz) expensive, requires specialized artists

RTFM Solution:

  • Instant previz: Directors describe shots, RTFM generates 3D scenes
  • Virtual scouting: Explore generated locations before travel
  • Dynamic environments: Test different lighting, weather, camera angles

Example: Action sequence planning

  • Director prompts: “Chase through narrow alleyways at night”
  • RTFM generates environment
  • Director “films” virtual cameras to plan real shoot

Cost Savings: Reduce location scouting trips, minimize on-set changes

5. Autonomous Vehicles

Challenge: Self-driving cars need to predict how scenes evolve (pedestrians, traffic)

RTFM Solution:

  • Predictive simulation: Generate 3D future states (where will pedestrian cross?)
  • Scenario testing: Simulate rare events (child runs into street) without risk
  • Sensor fusion: Combine camera, LIDAR, radar into unified 3D world model

Example: Intersection navigation

  • Car’s sensors feed data to RTFM
  • RTFM predicts trajectories of other vehicles, pedestrians
  • Car plans safe path through intersection

Safety Impact: Better anticipation of edge cases, fewer accidents

6. AR/VR and Spatial Computing

Challenge: AR requires understanding real-world 3D geometry; VR needs compelling content

RTFM Solution:

  • AR occlusion: Virtual objects correctly hidden behind real furniture
  • AR interaction: Place virtual items that respect real surfaces
  • VR content generation: Infinite explorable worlds

Example: AR interior design

  • Point phone at empty room
  • Prompt: “Show me a bohemian bedroom setup”
  • RTFM generates furniture, decor in 3D
  • Walk around to view from all angles

Consumer Appeal: Mainstream AR/VR adoption as content barriers fall

Challenges and Limitations

1. Fine-Detail Realism

Current State: RTFM likely excels at large-scale geometry (room layouts, terrain) but may struggle with:

  • Intricate textures (fabric weaves, wood grain)
  • Small objects (books on shelves, kitchen utensils)
  • Photorealistic materials (subsurface scattering in skin, complex reflections)

Workaround: Combine RTFM for structure with specialized texture synthesis models

2. Physics Accuracy

Limitations:

  • Static scenes easier than dynamic (moving water, cloth simulation)
  • May not perfectly model complex physics (fluid dynamics, soft-body deformation)

Use Case Impact: Sufficient for visualization; insufficient for engineering simulation

3. Prompt Sensitivity

Challenge: Small wording changes may produce vastly different results

  • “Cozy cabin” vs. “Rustic cabin” could yield different styles

Solution: Iterative refinement; style guides for consistent outputs

4. Computational Requirements

H100 Access:

  • Still $30,000+ hardware (or cloud costs)
  • Not yet consumer-accessible
  • Limits experimentation to funded projects

Future Path: Optimization for cheaper GPUs (A100, RTX series)

5. Dataset Bias and Diversity

Training Data Concerns:

  • 3D datasets smaller and less diverse than text/image datasets
  • May favor Western architecture, modern styles
  • Limited representation of historical, cultural, or non-standard environments

Mitigation: Expand training data with global 3D scans, synthetic diversity

Competitive Landscape

World Labs vs. Other 3D AI Companies

Google (Dreamfusion, NeRF, Immersive View):

  • Strong research but scattered commercial products
  • RTFM potentially faster and more interactive

NVIDIA (Instant NeRF, Omniverse):

  • Excellent tools for professionals
  • Less focus on generative AI; more on optimization

Meta (Habitat, Reality Labs):

  • VR/AR focus with internal 3D AI research
  • Not yet offering general-purpose world model tools

Unity/Unreal (Game Engines):

  • Powerful but require manual content creation
  • RTFM could integrate as procedural generation plugin

Startups (Luma AI, Poly, Kaedim):

  • Specific niches (NeRF capture, 3D asset generation)
  • RTFM more ambitious in scope (full world modeling)

World Labs’ Advantage:

  • Fei-Fei Li’s reputation attracts top talent and funding
  • Clear vision for LWMs as new AI paradigm
  • First-mover advantage in real-time 3D generative models

What’s Next for World Labs and RTFM?

Short-Term (2025-2026)

Beta Access:

  • Likely rolling out to researchers, partners
  • Blog and live demo available at worldlabs.ai/blog/rtfm

API Launch:

  • Developers integrate RTFM into apps, games, tools
  • Pricing model (pay-per-generation, subscription, enterprise licenses)

Performance Optimization:

  • Support for A100, L40S, consumer GPUs
  • Mobile/edge deployment for AR applications

Mid-Term (2026-2027)

Multimodal Integration:

  • Combine RTFM with LLMs (ChatGPT describes scene → RTFM generates it)
  • Audio integration (spatial sound design for generated environments)

Specialized Verticals:

  • Robotics SDK (ROS integration, sim-to-real transfer)
  • Gaming toolkit (Unity/Unreal plugins)
  • Architecture suite (CAD integration, building codes)

Improved Realism:

  • Photorealistic materials, lighting
  • Dynamic physics (water, smoke, cloth)

Long-Term (2028+)

General World Simulator:

  • Predict physical outcomes (drop glass → shatters realistically)
  • Enable scientific experimentation in virtual physics labs

Embodied AGI Foundation:

  • LWMs as “spatial understanding” component of AGI
  • Combined with LLMs (language) and robotic control (action)

Metaverse Infrastructure:

  • Power persistent, user-modifiable virtual worlds
  • Billions of users creating and exploring 3D content

Implications for the AI Industry

1. Spatial AI as New Frontier

Shift in Focus:

  • 2020-2023: Text (GPT-3, GPT-4, LLMs dominate)
  • 2024: Images (DALL-E 3, Midjourney, Stable Diffusion)
  • 2025: Video (Sora 2, Veo 3.1, Runway Gen-3)
  • 2025+: 3D and World Models (RTFM, LWMs)

Investment Surge: Expect more funding for spatial AI startups as investors recognize World Labs’ traction

2. Hardware Acceleration

GPU Demand: RTFM’s H100 requirement drives demand for high-end AI accelerators

  • NVIDIA benefits from continued AI boom
  • AMD, Intel push competing products

Custom Silicon: Future world models may use specialized chips (like Google’s TPUs for LLMs)

3. Convergence of AI Modalities

Unified Models: Future systems may combine:

  • LLMs (language understanding)
  • Diffusion models (image/video)
  • LWMs (3D/spatial reasoning)

Example: “Design a Mediterranean villa” → LLM interprets → LWM generates 3D → Video model creates walkthrough

4. Regulation and Ethics

Deepfake Concerns: 3D deepfakes of real locations (e.g., White House interior) could spread misinformation

IP Protection: Generating 3D replicas of copyrighted architecture, products raises legal questions

Bias and Representation: Who decides what “realistic” or “beautiful” spaces look like?

Conclusion

Fei-Fei Li’s RTFM is nothing short of a breakthrough in spatial AI. By achieving real-time, 3D-consistent, generative world modeling on a single H100 GPU, World Labs has delivered on the promise of Large World Models—AI systems that understand space, geometry, and physics as deeply as LLMs understand language.

RTFM’s significance extends beyond technical achievement:

  • Democratizes 3D AI: Researchers and creators gain access to tools previously requiring supercomputers
  • Enables new applications: Robotics, gaming, AR/VR, autonomous vehicles all benefit from real-time world simulation
  • Positions spatial AI as the next major frontier after text, image, and video generation

As World Labs opens beta access and moves toward API launch, the AI industry will be watching closely. If RTFM delivers on its promise, we may look back on October 16, 2025 as the moment spatial intelligence joined language and vision as a pillar of artificial intelligence.

The future isn’t just about understanding words or images. It’s about understanding worlds.


Stay updated on the latest spatial AI and world model breakthroughs at AI Breaking.