Themis
AI & ML interests
None defined yet.
Recent Activity
Abstract:
Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.
Themis reward models are trained using the Bradley-Terry preference framework with a multi-stage data pipeline that mines, filters, scores, and assembles high-quality code preference pairs from open-source repositories. The models are evaluated on Code RewardBench (CRB), a benchmark of 8,866 preference pairs spanning 5 quality aspects and 8 programming languages.
Pipeline Overview
The end-to-end pipeline has three phases: dataset construction, model training, and evaluation.
DATASET CONSTRUCTION
ββββββββββββββββββββ
BigQuery (github_repos)
β
βΌ
βββββββββββββββββββββββ βββββββββββββββββββββ ββββββββββββββββββββ
β 1. Commit Mining ββββΆβ 2. Repo Filtering ββββΆβ 3. Ext Filtering β
β (SQL) β β (allowlists) β β (lang β ext) β
βββββββββββββββββββββββ βββββββββββββββββββββ ββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β 4. Content Retrieval ββββΆβ 5. Deduplication ββββΆβ 6. Aspect Filter β
β (git fetch) β β (MinHash LSH) β β (ModernBERT) β
ββββββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β 7. LLM Scoring & β-ββΆβ 8. LLM-as-a-JudgeββββΆβ 9. Training Data β
β Instruction Synth β β (A/B voting) β β Assembly β
ββββββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β
MODEL TRAINING β
ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Bradley-Terry preference training with FSDP2 on multi-node GPUs β
β (BT loss + LM regularisation + magnitude penalty, Liger kernels) β
βββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
EVALUATION β
ββββββββββ β
βββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Code RewardBench: 8,866 pairs Γ 5 aspects Γ 8 languages β
β Evaluated across scalar, MoE, and generative RM architectures β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Results
Themis-RM models achieve best-in-class accuracy on Themis-CodeRewardBench, a code-specific reward model benchmark, while also matching or exceeding much larger models on established general-domain benchmarks (RewardBench V1, RewardBench V2, JudgeBench). Models are grouped by parameter class; bold marks the best in each group.
| Model | Themis-CodeRewardBench | RewardBench V1 | RewardBench V2 | JudgeBench |
|---|---|---|---|---|
| 32B - 72B Class | ||||
| WorldPM-72B | 76.96 | 90.88 | 67.92 | 55.21 |
| Athene-RM-70B | 78.39 | 91.22 | 68.76 | 63.45 |
| Nemotron-70B-Reward | 81.19 | 93.88 | 70.49 | 73.47 |
| Themis-RM-32B | 91.82 | 94.89 | 72.34 | 71.65 |
| AceCodeRM-32B | 62.95 | 23.58 | 67.98 | 66.77 |
| 7B β 14B Class | ||||
| Themis-RM-14B | 91.19 | 94.11 | 71.44 | 70.85 |
| Themis-RM-8B | 89.78 | 93.69 | 65.87 | 69.97 |
| Athene-RM-8B | 76.58 | 87.48 | 62.96 | 61.12 |
| CodeScaler-8B | 79.12 | 94.66 | 76.51 | 70.05 |
| Skywork-Reward-V2-8B | 79.97 | 94.76 | 76.93 | 67.90 |
| AceCodeRM-7B | 71.11 | 22.74 | 63.16 | 61.09 |
| 0.6B - 4B Class | ||||
| Themis-RM-4B | 88.39 | 92.46 | 63.81 | 68.02 |
| CodeScaler-4B | 77.97 | 94.32 | 75.13 | 68.44 |
| Skywork-Reward-V2-4B | 79.27 | 94.06 | 74.26 | 65.43 |
| Themis-RM-1.7B | 83.04 | 89.17 | 56.22 | 63.29 |
| CodeScaler-1.7B | 73.75 | 91.13 | 68.44 | 66.17 |
| Skywork-Reward-V2-1.7B | 75.60 | 91.64 | 67.71 | 66.48 |
| Themis-RM-0.6B | 79.26 | 83.41 | 49.61 | 63.84 |
| Skywork-Reward-V2-0.6B | 72.77 | 86.32 | 60.83 | 63.65 |
Datasets
All datasets are available on HuggingFace:
| Dataset | Description | Samples |
|---|---|---|
| Themis-CodeRewardBench | Code RM evaluation benchmark: 5 quality dimensions, 8 languages, 19 source subsets | 8,866 |
| Themis-CodePreference | Training data for the PM stage: code preferences across 5 criteria and 8 languages | 354,010 |
| Themis-GeneralPreference | Training data for the PT stage: general-domain and code retrieval preferences | 110,598 |
| Themis-Git-Commits-Merged | Single-file commits from merged PRs across 24 languages (intermediate, pre-classification) | ~8M |
| Themis-Git-Commits | Raw mined single-file commits from permissively licensed repos (full unfiltered pool) | ~28M |
Related Work
Distributed Training Tutorial β A companion tutorial by us that walks through multi-node distributed training of scalar reward models on cloud GPU clusters. Covers cluster provisioning, high-speed networking, container management, and FSDP-based training. Useful as a standalone guide for anyone looking to reproduce the Themis training setup or adapt it to their own reward modelling workloads. Follows a simplified recipe that leverages the Axolotl framework for training reward models with the Bradley-Terry loss.
Citation
@article{themis2025,
title={Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring},
author={Paul, Indraneil and Gurevych, Iryna and Glava\v{s}, Goran},
journal={arXiv preprint arXiv:xxxx.xxxxx},
year={2025}
}
-
project-themis/Themis-RM-0.6B
Text Classification β’ 0.6B β’ Updated β’ 153 -
project-themis/Themis-RM-1.7B
Text Classification β’ 2B β’ Updated β’ 35 -
project-themis/Themis-RM-4B
Text Classification β’ 4B β’ Updated β’ 34 -
project-themis/Themis-RM-8B
Text Classification β’ 8B β’ Updated β’ 118
-
project-themis/Themis-RM-0.6B
Text Classification β’ 0.6B β’ Updated β’ 153 -
project-themis/Themis-RM-1.7B
Text Classification β’ 2B β’ Updated β’ 35 -
project-themis/Themis-RM-4B
Text Classification β’ 4B β’ Updated β’ 34 -
project-themis/Themis-RM-8B
Text Classification β’ 8B β’ Updated β’ 118