1. Preface

Anti_9 is an RL layer applied to LLMs. It employs the concept of "comparison" to endow the model with multidimensional consciousness. General intelligence is fundamentally rooted in comparison: from deep machine learning to our daily decision-making, comparison serves as the most fundamental lever for the emergence of intelligence. Consider why you chose to wear a T-shirt this morning, or why you decided to buy it in the first place. Although many researchers have devoted enormous efforts to AI analogical reasoning, and techniques such as k-clustering and graph classification have long been used in unsupervised learning, we still believe that Anti_9 represents a viable implementation for LLMs.

2. Benchmark

Compared original base leaderboard data from artificialanalysis

We optimized the base model Deepseek-V3.2-thinking by 6.54% with a base score of 85.86%, reaching a final score of 92.6%, which is top 5 among LLMs worldwide.
The test record is on GitHub. You can click here to check.

3. Quick Start

1. General Use

For this part, you can check on github(click)

2. Professional Use

For this part, you can check on github(click)

4. How It Works

1. Parallel Generation of Samples

Instead of directly generating the answer or using reasoning as one-dimensional thinking, we make the LLM generate several compare samples as guide tokens. This allows the LLM to have a degree of association away from the original question. To simulate how humans gain experience and form unique character, we implement memory as token heads during the generation of compare samples. Although more samples will make the generated result more accurate in the final analytic process, we limited it to 40 compare samples and 30 pre-compare samples in this project, considering FLOPs consumption.

2. Pre-memory

We select the top 10 memories here as token guides, using the following formulas:

$Y_{\text{mod}}(t) = A \cdot \ln(t+1) \cdot \frac{a+1}{b+0.5} \cdot e^{-r_{\text{eff}}(t) \cdot t} + C$

$r_{\text{eff}}(x) = \begin{cases} 0.01 + 0.000625x & x < 8 \\ 0.015 & 8 \le x \le 18 \\ 0.015 + 0.0005(x - 18) & x > 18 \end{cases}$

t: Varies from 6h to 24h, representing the basic memory status in the gradient.
a (pick_a): Tracks how many times a memory item has been selected.
b (raised_b): Tracks how many times a memory item has been picked as the final answer.
r_eff(t): A piecewise time-dependent decay rate (learning rate) controlling the forgetting speed.
A, C: Bias terms.

The memory used in this project is not overfitted for the benchmark; it is trained on these datasets for general use:

Dataset Name	Purpose
TurtleBench-extended-en	Hidden long-details reasoning
hendrycks_competition_math_N_A	Learn scientific reasoning format

We limit the memory rows to 75 to manage FLOPs consumption. You can use the full memory file: memery_st_ori.db.

3. Final Result

We embed the user input and each sample, calculate the Euclidean distance between the two vectors as the closeness score, and compute the mean Euclidean distance of all samples. We then find the closest answer group to the mean Euclidean distance line.

Euclidean Distance:

$d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$

Schematic diagram

Real data diagram

Our models are trained using supervised learning to find the closest answer to the next token, so they are more like group B in the graph. However, when we use compare samples to raise the answer dimension, we observe something interesting: different answers appear, and a group of answers clusters around the mean closeness distance line. Many times, we feel that AI is not smart enough and doesn't give us the answer we want because the answers only focus on the surface meaning of the requirement. This is not wrong, because our engineers train them to find the shortest distance to the correct token. However, human-like intelligence is not that simple. We use association in our daily decisions — a combination of common sense, experience, and logic that is not explicitly spoken but exists in our minds.
We are not guessing the answers. Even if the LLM makes a mistake once or twice, with the mean Euclidean distance, we always find the correct answer group from the parallel sample universe. As seen in the graph below, when the mean_line_ratio changes, the selected answer varies, but the correct answer group always leads.
And comapre samples are different from DFS, they are parallel to each other, as parallel space.

cd tools

You can use this tool to see the option variation:

python varie_compare_4_options.py

Options vary with the mean line ratio for LLM previously wrong-answered samples

Therefore, we find that difficult questions like those in the GPQA benchmark, scientific problems, and the TurtleBench dataset all share something in common: they contain hidden facts not explicitly stated in the question itself, requiring judgment and discovery. For such questions, the correct answer may not be the closest to the question or may be far from the original question. The correct answer requires some association and lies around the mean closeness score.
You can use the tools below to find the best mean_line_ratio for individual tests and general use:

python best_mean_ratio_randomoption.py

python best_ratio_alldatabase.py

Best mean line ratio for a single test

Best mean line ratio for general use

Obviously, the mean closeness score differs across models and also varies between normal questions and scientific questions. The parameters we use in this project for the GPQA benchmark are in config_adjust.json.
Here, mean_line_ratio shifts the mean closeness score up or down to fit different models' answer groups. This parameter should be adjusted based on the behavior of different models and the difficulty of the problems; it is only stable statistically. The temperature is for answer generation with DeepSeek-V3.2-thinking, and temperature_compare is for sample generation with Ernie 4.5. This combination of model APIs considers the ability for different tasks and token pricing.

Create Multidimension for Small Transformer:

For better performance in final answer judgment, we add a small transformer. To create dimensions for the model to learn, we transform the sample closeness score map to:

Count, Mean, Median, Std, Variance, Min, Max, Range, Skewness, Kurtosis, GMM_pi, GMM_mu, GMM_sigma

We also add zero-padding techniques and diagonal matrices as commonly used. Contact us for details.
We achieve about 1–2% improvement for this part in our final benchmark score.

Benchmark Test:

Since our architecture adopts parallel computing, to reduce overall costs, we recommend following the process below to lower resource consumption.

Benchmark process for dataset

We apply a ±(1–2) accuracy correction margin in our benchmark because we identified ambiguous cases such as Problem No.73:"A textile dye containing an extensively conjugated π-electron system emits light with energy of 2.3393 eV. What color of light is absorbed by the organic compound?". The dataset uses the complementary color principle, which leads to the answer “red” for this energy value. However, the question states that the compound emits light, rather than simply reflecting it. This introduces ambiguity, because if the material stores energy and then emits it, energy loss must be considered. In that case, the absorbed photon energy should be higher than the emitted energy of 2.3393 eV. For this reason, we introduce a correction margin, as the question can be misleading even for human researchers and the machine is imitating human reasoning behavior, which makes it similarly sensitive to such ambiguities.

5. Base Models

Model	Task
deepseek-v3.2-thinking	Answer generation
ernie-4.5	Sample generation
glm-4.7	Final answer analysis

6. References

Liu, Z., & Meng, L. (2018). Application of multivariate statistical analysis to ... Proceedings of the 17th National Conference on Mathematical ...
Wang, X., Wei, J., Schuurmans, D., Le, Q., & Zhou, D. (2022). Self-consistency improves chain-of-thought reasoning in language models. arXiv preprint arXiv:2203.11171.
Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems (NeurIPS 2020).
Rein, D., et al. (2023). GPQA: A graduate-level Google-proof question answering benchmark. arXiv preprint arXiv:2311.12022.
Artificial Analysis. (2024). GPQA Diamond evaluation benchmark.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Brown, T. B., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS 2020).

7. License

This repository and the model weights are licensed under the MIT License.

8. Final Words

Our next task will be the ARC Prize. I would be glad if you try this Compare Anti_9 on other benchmarks. If you do, please leave a score here:

Leaderboard Name	Score

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Hotblaz/Compare_Anti_9

Base model

deepseek-ai/DeepSeek-V3.2-Exp-Base

Finetuned

deepseek-ai/DeepSeek-V3.2

Finetuned

(36)

this model

Papers for Hotblaz/Compare_Anti_9

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper • 1810.04805 • Published Oct 11, 2018 • 28

Evaluation results

Accuracy on GPQA Diamond
test set Community Eval

92.6%