Organization Card

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Abstract:

Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.

Themis reward models are trained using the Bradley-Terry preference framework with a multi-stage data pipeline that mines, filters, scores, and assembles high-quality code preference pairs from open-source repositories. The models are evaluated on Code RewardBench (CRB), a benchmark of 8,866 preference pairs spanning 5 quality aspects and 8 programming languages.

Pipeline Overview

The end-to-end pipeline has three phases: dataset construction, model training, and evaluation.

                          DATASET CONSTRUCTION
                          ────────────────────
  BigQuery (github_repos)
      │
      ▼
  ┌─────────────────────┐   ┌───────────────────┐   ┌──────────────────┐
  │ 1. Commit Mining    │──▶│ 2. Repo Filtering │──▶│ 3. Ext Filtering │
  │    (SQL)            │   │    (allowlists)   │   │    (lang → ext)  │
  └─────────────────────┘   └───────────────────┘   └──────────────────┘
                                                            │
      ┌─────────────────────────────────────────────────────┘
      ▼
  ┌──────────────────────┐   ┌──────────────────┐   ┌──────────────────┐
  │ 4. Content Retrieval │──▶│ 5. Deduplication │──▶│ 6. Aspect Filter │
  │    (git fetch)       │   │    (MinHash LSH) │   │    (ModernBERT)  │
  └──────────────────────┘   └──────────────────┘   └──────────────────┘
                                                            │
      ┌─────────────────────────────────────────────────────┘
      ▼
  ┌──────────────────────┐   ┌──────────────────┐   ┌──────────────────┐
  │ 7. LLM Scoring &     │-─▶│ 8. LLM-as-a-Judge│──▶│ 9. Training Data │
  │    Instruction Synth │   │    (A/B voting)  │   │    Assembly      │
  └──────────────────────┘   └──────────────────┘   └──────────────────┘
                                                            │
                          MODEL TRAINING                    │
                          ──────────────                    │
      ┌─────────────────────────────────────────────────────┘
      ▼
  ┌───────────────────────────────────────────────────────────────────┐
  │ Bradley-Terry preference training with FSDP2 on multi-node GPUs   │
  │ (BT loss + LM regularisation + magnitude penalty, Liger kernels)  │
  └───────────────────────────────────┬───────────────────────────────┘
                                      │
                          EVALUATION  │
                          ──────────  │
      ┌───────────────────────────────┘
      ▼
  ┌───────────────────────────────────────────────────────────────────┐
  │ Code RewardBench: 8,866 pairs × 5 aspects × 8 languages           │
  │ Evaluated across scalar, MoE, and generative RM architectures     │
  └───────────────────────────────────────────────────────────────────┘

Results

Themis-RM models achieve best-in-class accuracy on Themis-CodeRewardBench, a code-specific reward model benchmark, while also matching or exceeding much larger models on established general-domain benchmarks (RewardBench V1, RewardBench V2, JudgeBench). Models are grouped by parameter class; bold marks the best in each group.

Model	Themis-CodeRewardBench	RewardBench V1	RewardBench V2	JudgeBench

32B - 72B Class
WorldPM-72B	76.96	90.88	67.92	55.21
Athene-RM-70B	78.39	91.22	68.76	63.45
Nemotron-70B-Reward	81.19	93.88	70.49	73.47
Themis-RM-32B	91.82	94.89	72.34	71.65
AceCodeRM-32B	62.95	23.58	67.98	66.77

7B – 14B Class
Themis-RM-14B	91.19	94.11	71.44	70.85
Themis-RM-8B	89.78	93.69	65.87	69.97
Athene-RM-8B	76.58	87.48	62.96	61.12
CodeScaler-8B	79.12	94.66	76.51	70.05
Skywork-Reward-V2-8B	79.97	94.76	76.93	67.90
AceCodeRM-7B	71.11	22.74	63.16	61.09

0.6B - 4B Class
Themis-RM-4B	88.39	92.46	63.81	68.02
CodeScaler-4B	77.97	94.32	75.13	68.44
Skywork-Reward-V2-4B	79.27	94.06	74.26	65.43
Themis-RM-1.7B	83.04	89.17	56.22	63.29
CodeScaler-1.7B	73.75	91.13	68.44	66.17
Skywork-Reward-V2-1.7B	75.60	91.64	67.71	66.48
Themis-RM-0.6B	79.26	83.41	49.61	63.84
Skywork-Reward-V2-0.6B	72.77	86.32	60.83	63.65

Datasets

All datasets are available on HuggingFace:

Dataset	Description	Samples
Themis-CodeRewardBench	Code RM evaluation benchmark: 5 quality dimensions, 8 languages, 19 source subsets	8,866
Themis-CodePreference	Training data for the PM stage: code preferences across 5 criteria and 8 languages	354,010
Themis-GeneralPreference	Training data for the PT stage: general-domain and code retrieval preferences	110,598
Themis-Git-Commits-Merged	Single-file commits from merged PRs across 24 languages (intermediate, pre-classification)	~8M
Themis-Git-Commits	Raw mined single-file commits from permissively licensed repos (full unfiltered pool)	~28M

Related Work

Distributed Training Tutorial — A companion tutorial by us that walks through multi-node distributed training of scalar reward models on cloud GPU clusters. Covers cluster provisioning, high-speed networking, container management, and FSDP-based training. Useful as a standalone guide for anyone looking to reproduce the Themis training setup or adapt it to their own reward modelling workloads. Follows a simplified recipe that leverages the Axolotl framework for training reward models with the Bradley-Terry loss.

Citation

@article{themis2025,
  title={Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring},
  author={Paul, Indraneil and Gurevych, Iryna and Glava\v{s}, Goran},
  journal={arXiv preprint arXiv:xxxx.xxxxx},
  year={2025}
}