Abstract
Rubric-based on-policy distillation demonstrates superior sample efficiency compared to traditional logit-based methods while maintaining compatibility with black-box scenarios.
On-policy distillation (OPD) is a powerful paradigm for model alignment, yet its reliance on teacher logits restricts its application to white-box scenarios. We contend that structured semantic rubrics can serve as a scalable alternative to teacher logits, enabling OPD using only teacher-generated responses. To prove it, we introduce ROPD, a simple yet foundational framework for rubric-based OPD. Specifically, ROPD induces prompt-specific rubrics from teacher-student contrasts, and then utilizes these rubrics to score the student rollouts for on-policy optimization. Empirically, ROPD outperforms the advanced logit-based OPD methods across most scenarios, and achieving up to a 10x gain in sample efficiency. These results position rubric-based OPD as a flexible, black-box-compatible alternative to the prevailing logit-based OPD, offering a simple yet strong baseline for scalable distillation across proprietary and open-source LLMs. Code is available at https://github.com/Peregrine123/ROPD_official.
Community
Rubric-based on-policy distillation demonstrates superior sample efficiency compared to traditional logit-based methods while maintaining compatibility with black-box scenarios.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A Survey of On-Policy Distillation for Large Language Models (2026)
- DP-OPD: Differentially Private On-Policy Distillation for Language Models (2026)
- MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate (2026)
- Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe (2026)
- Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation (2026)
- Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes (2026)
- SODA: Semi On-Policy Black-Box Distillation for Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.07396 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper