Proxy3D-8B

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

Proxy3D-8B is a vision-language model (VLM) specialized in 3D scene understanding and spatial reasoning. It is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct using the Proxy3D method, which produces compact yet comprehensive 3D proxy representations for the vision modality to overcome the limitations of standard 2D pipelines.

Paper: arXiv:2605.08064
Project Page: wzzheng.net/Proxy3D
GitHub Repository: Spacedreamer2384/Proxy3D
Dataset: SpaceSpan-318K

Model Description

Spatial intelligence in vision-language models (VLMs) is crucial for reasoning in 3D environments. Proxy3D addresses this by extracting scene features using semantic and geometric encoders from video frames, then performing semantic-aware clustering to obtain a set of proxies in 3D space.

By utilizing these compact proxy representations, the model achieves state-of-the-art performance in 3D visual question answering (VQA), visual grounding, and general spatial intelligence benchmarks while maintaining high efficiency.

Training Procedure

The model was trained using a four-stage progressive iterative pipeline to develop spatial reasoning skills, ranging from initial image-text alignment to complex 3D reasoning on the SpaceSpan dataset.

Training Hyperparameters

The following hyperparameters were used during training:

Learning rate: 5e-06
Train batch size: 8
Total train batch size: 128
Optimizer: adamw_torch (betas=(0.9,0.999), epsilon=1e-08)
LR scheduler type: cosine
LR scheduler warmup ratio: 0.1
Number of epochs: 1.0

Framework Versions

Transformers 4.55.0
Pytorch 2.6.0+cu118
Datasets 3.1.0
Tokenizers 0.21.1

Usage

Running this model requires a specific environment setup and custom configuration files to handle the Qwen2VLBEVForConditionalGeneration architecture. Please refer to the Setup section of the GitHub repository for detailed instructions on how to install and run inference.

Citation

If you find Proxy3D useful for your research, please cite:

@article{proxy3d2026,
  title={Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment},
  author={Jiang, Jerry and Sun, Haowen and Gudovskiy, Denis and Nakata, Yohei and Okuno, Tomoyuki and Keutzer, Kurt and Zheng Wenzhao},
  journal={arXiv preprint arXiv:2605.08064},
  year={2026}
}