Proxy3D-8B
Proxy3D-8B is a vision-language model (VLM) specialized in 3D scene understanding and spatial reasoning. It is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct using the Proxy3D method, which produces compact yet comprehensive 3D proxy representations for the vision modality to overcome the limitations of standard 2D pipelines.
- Paper: arXiv:2605.08064
- Project Page: wzzheng.net/Proxy3D
- GitHub Repository: Spacedreamer2384/Proxy3D
- Dataset: SpaceSpan-318K
Model Description
Spatial intelligence in vision-language models (VLMs) is crucial for reasoning in 3D environments. Proxy3D addresses this by extracting scene features using semantic and geometric encoders from video frames, then performing semantic-aware clustering to obtain a set of proxies in 3D space.
By utilizing these compact proxy representations, the model achieves state-of-the-art performance in 3D visual question answering (VQA), visual grounding, and general spatial intelligence benchmarks while maintaining high efficiency.
Training Procedure
The model was trained using a four-stage progressive iterative pipeline to develop spatial reasoning skills, ranging from initial image-text alignment to complex 3D reasoning on the SpaceSpan dataset.
Training Hyperparameters
The following hyperparameters were used during training:
- Learning rate: 5e-06
- Train batch size: 8
- Total train batch size: 128
- Optimizer: adamw_torch (betas=(0.9,0.999), epsilon=1e-08)
- LR scheduler type: cosine
- LR scheduler warmup ratio: 0.1
- Number of epochs: 1.0
Framework Versions
- Transformers 4.55.0
- Pytorch 2.6.0+cu118
- Datasets 3.1.0
- Tokenizers 0.21.1
Usage
Running this model requires a specific environment setup and custom configuration files to handle the Qwen2VLBEVForConditionalGeneration architecture. Please refer to the Setup section of the GitHub repository for detailed instructions on how to install and run inference.
Citation
If you find Proxy3D useful for your research, please cite:
@article{proxy3d2026,
title={Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment},
author={Jiang, Jerry and Sun, Haowen and Gudovskiy, Denis and Nakata, Yohei and Okuno, Tomoyuki and Keutzer, Kurt and Zheng Wenzhao},
journal={arXiv preprint arXiv:2605.08064},
year={2026}
}
Acknowledgements
This work builds upon several excellent repositories, including Qwen2.5-VL, LLaMAFactory, and GPT4Scene.
- Downloads last month
- 24
Model tree for Spacewanderer8263/Proxy3D-8B
Base model
Qwen/Qwen2.5-VL-7B-Instruct