ai_code_detect

Binary classifier: human-written vs. AI-generated code. Trained on 500k samples (Python, Java, C++). Macro F1: 0.9813.

Architecture

Two input streams fused into a single MLP classifier.

Stream 1 — Probabilistic

Code is passed through Salesforce/codegen-350M-mono. Per-token surprisal signals are extracted across a 256-token window:

#	Feature	Description
0	`log_prob`	Log-probability of the actual token
1	`log_rank`	Log-rank within the distribution
2	`entropy`	Shannon entropy of the token distribution
3	`varentropy`	Variance of entropy
4	`top10_mass`	Probability mass in top-10 tokens
5	`gap_1_2`	Log-prob gap between rank-1 and rank-2
6	`surprisal_z`	Per-token surprisal z-score
7	`entropy_delta`	Entropy change from previous position
8	`cum_rank`	Cumulative mean log-rank
9	`is_special`	Special token flag
10	`r10_flag`	Rank ≤ 10
11	`r100_flag`	10 < rank ≤ 100

These 12 per-token features aggregate into 32 sequence-level statistics (moments, autocorrelations, burstiness, etc.) passed downstream.

Stream 2 — Semantic

Salesforce/codet5-base mean-pools hidden states into a 768-dim embedding capturing style, structure, naming, and comment density.

Classifier

Token (256-dim) + sequence (64-dim) + semantic (768-dim) representations are concatenated → 1088-dim → 3-layer MLP with LayerNorm, GELU, dropout → sigmoid.

Performance

Evaluated on 3,000 balanced validation samples (1,000/language):

Metric	Score
Macro F1	0.9813
Accuracy	98.13%
Threshold	0.475

Language	Accuracy	Human p̄	AI p̄	Gap
Python	99.50%	0.001	0.992	0.991
Java	98.00%	0.043	0.968	0.926
C++	96.90%	0.063	0.966	0.903

Training

Setting	Value
Optimizer	AdamW (encoder lr 8e-6, head lr 3e-5)
Scheduler	OneCycleLR + cosine annealing
Loss	BCEWithLogitsLoss
Regularization	EMA (decay=0.998), dropout, LayerNorm
Precision	fp16 via HuggingFace Accelerate
Hardware	2× GPU
Epochs	4 (500k samples)

How To Use

import os
import sys
from huggingface_hub import hf_hub_download

REPO_ID = "santh-cpu/ai_code_detect"
script_path = hf_hub_download(repo_id=REPO_ID, filename="model.py")
sys.path.append(os.path.dirname(script_path))
from model import predict

print(predict("your code here"))

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support