SketchVLM: Vision language models can annotate images to explain thoughts and guide users Paper • 2604.22875 • Published 9 days ago • 31
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning Paper • 2507.01006 • Published Jul 1, 2025 • 255
Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale Paper • 2603.25040 • Published Mar 26 • 131
FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow Paper • 2603.19598 • Published Mar 20 • 32
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models Paper • 2603.18002 • Published Mar 18 • 13
The Trinity of Consistency as a Defining Principle for General World Models Paper • 2602.23152 • Published Feb 26 • 201
DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models Paper • 2512.15713 • Published Dec 17, 2025 • 18
Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO Paper • 2602.06422 • Published Feb 6 • 47
Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text Paper • 2601.22975 • Published Jan 30 • 111
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding Paper • 2512.13586 • Published Dec 15, 2025 • 93
HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions Paper • 2506.19639 • Published Jun 24, 2025 • 2