Abstract
World-R1 framework improves video generation by incorporating 3D constraints through reinforcement learning and specialized text datasets while maintaining visual quality and scalability.
Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.
Community
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
We highly recommend visiting our Project Page for a better experience: https://aka.ms/world-r1
If you're interested, please give us a star: https://github.com/microsoft/World-R1
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models (2026)
- Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding (2026)
- VS3R: Robust Full-frame Video Stabilization via Deep 3D Reconstruction (2026)
- UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models (2026)
- VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward (2026)
- OrbitNVS: Harnessing Video Diffusion Priors for Novel View Synthesis (2026)
- VisionNVS: Self-Supervised Inpainting for Novel View Synthesis under the Virtual-Shift Paradigm (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.24764 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper