1. Preface
- Anti_9 is an RL layer applied to LLMs. It employs the concept of "comparison" to endow the model with multidimensional consciousness. General intelligence is fundamentally rooted in comparison: from deep machine learning to our daily decision-making, comparison serves as the most fundamental lever for the emergence of intelligence. Consider why you chose to wear a T-shirt this morning, or why you decided to buy it in the first place. Although many researchers have devoted enormous efforts to AI analogical reasoning, and techniques such as k-clustering and graph classification have long been used in unsupervised learning, we still believe that Anti_9 represents a viable implementation for LLMs.
2. Benchmark
- We optimized the base model Deepseek-V3.2-thinking by 6.54% with a base score of 85.86%, reaching a final score of 92.6%, which is top 5 among LLMs worldwide.
- The test record is on GitHub. You can click here to check.
3. Quick Start
1. General Use
- For this part, you can check on github(click)
2. Professional Use
- For this part, you can check on github(click)
4. How It Works
1. Parallel Generation of Samples
- Instead of directly generating the answer or using reasoning as one-dimensional thinking, we make the LLM generate several compare samples as guide tokens. This allows the LLM to have a degree of association away from the original question. To simulate how humans gain experience and form unique character, we implement memory as token heads during the generation of compare samples. Although more samples will make the generated result more accurate in the final analytic process, we limited it to 40 compare samples and 30 pre-compare samples in this project, considering FLOPs consumption.
2. Pre-memory
- We select the top 10 memories here as token guides, using the following formulas:
a (pick_a): Tracks how many times a memory item has been selected.
b (raised_b): Tracks how many times a memory item has been picked as the final answer.
r_eff(t): A piecewise time-dependent decay rate (learning rate) controlling the forgetting speed.
A, C: Bias terms.
- The memory used in this project is not overfitted for the benchmark; it is trained on these datasets for general use:
| Dataset Name | Purpose |
|---|---|
| TurtleBench-extended-en | Hidden long-details reasoning |
| hendrycks_competition_math_N_A | Learn scientific reasoning format |
- We limit the memory rows to 75 to manage FLOPs consumption. You can use the full memory file:
memery_st_ori.db.
3. Final Result
- We embed the user input and each sample, calculate the Euclidean distance between the two vectors as the closeness score, and compute the mean Euclidean distance of all samples. We then find the closest answer group to the mean Euclidean distance line.
Euclidean Distance:
Schematic diagram
Real data diagram
Our models are trained using supervised learning to find the closest answer to the next token, so they are more like group B in the graph. However, when we use compare samples to raise the answer dimension, we observe something interesting: different answers appear, and a group of answers clusters around the mean closeness distance line. Many times, we feel that AI is not smart enough and doesn't give us the answer we want because the answers only focus on the surface meaning of the requirement. This is not wrong, because our engineers train them to find the shortest distance to the correct token. However, human-like intelligence is not that simple. We use association in our daily decisions — a combination of common sense, experience, and logic that is not explicitly spoken but exists in our minds.
We are not guessing the answers. Even if the LLM makes a mistake once or twice, with the mean Euclidean distance, we always find the correct answer group from the parallel sample universe. As seen in the graph below, when the
mean_line_ratiochanges, the selected answer varies, but the correct answer group always leads.And comapre samples are different from DFS, they are parallel to each other, as parallel space.
cd tools
- You can use this tool to see the option variation:
python varie_compare_4_options.py
Options vary with the mean line ratio for LLM previously wrong-answered samples
Therefore, we find that difficult questions like those in the GPQA benchmark, scientific problems, and the TurtleBench dataset all share something in common: they contain hidden facts not explicitly stated in the question itself, requiring judgment and discovery. For such questions, the correct answer may not be the closest to the question or may be far from the original question. The correct answer requires some association and lies around the mean closeness score.
You can use the tools below to find the best
mean_line_ratiofor individual tests and general use:
python best_mean_ratio_randomoption.py
python best_ratio_alldatabase.py
Best mean line ratio for a single test
Best mean line ratio for general use
Obviously, the mean closeness score differs across models and also varies between normal questions and scientific questions. The parameters we use in this project for the GPQA benchmark are in
config_adjust.json.Here,
mean_line_ratioshifts the mean closeness score up or down to fit different models' answer groups. This parameter should be adjusted based on the behavior of different models and the difficulty of the problems; it is only stable statistically. Thetemperatureis for answer generation with DeepSeek-V3.2-thinking, andtemperature_compareis for sample generation with Ernie 4.5. This combination of model APIs considers the ability for different tasks and token pricing.
Create Multidimension for Small Transformer:
- For better performance in final answer judgment, we add a small transformer. To create dimensions for the model to learn, we transform the sample closeness score map to:
Count, Mean, Median, Std, Variance, Min, Max, Range, Skewness, Kurtosis, GMM_pi, GMM_mu, GMM_sigma
We also add zero-padding techniques and diagonal matrices as commonly used. Contact us for details.
We achieve about 1–2% improvement for this part in our final benchmark score.
Benchmark Test:
- Since our architecture adopts parallel computing, to reduce overall costs, we recommend following the process below to lower resource consumption.
Benchmark process for dataset
- We apply a ±(1–2) accuracy correction margin in our benchmark because we identified ambiguous cases such as Problem No.73:"A textile dye containing an extensively conjugated π-electron system emits light with energy of 2.3393 eV. What color of light is absorbed by the organic compound?". The dataset uses the complementary color principle, which leads to the answer “red” for this energy value. However, the question states that the compound emits light, rather than simply reflecting it. This introduces ambiguity, because if the material stores energy and then emits it, energy loss must be considered. In that case, the absorbed photon energy should be higher than the emitted energy of 2.3393 eV. For this reason, we introduce a correction margin, as the question can be misleading even for human researchers and the machine is imitating human reasoning behavior, which makes it similarly sensitive to such ambiguities.
5. Base Models
| Model | Task |
|---|---|
| deepseek-v3.2-thinking | Answer generation |
| ernie-4.5 | Sample generation |
| glm-4.7 | Final answer analysis |
6. References
Liu, Z., & Meng, L. (2018). Application of multivariate statistical analysis to ... Proceedings of the 17th National Conference on Mathematical ...
Wang, X., Wei, J., Schuurmans, D., Le, Q., & Zhou, D. (2022). Self-consistency improves chain-of-thought reasoning in language models. arXiv preprint arXiv:2203.11171.
Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems (NeurIPS 2020).
Rein, D., et al. (2023). GPQA: A graduate-level Google-proof question answering benchmark. arXiv preprint arXiv:2311.12022.
Artificial Analysis. (2024). GPQA Diamond evaluation benchmark.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Brown, T. B., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS 2020).
7. License
This repository and the model weights are licensed under the MIT License.
8. Final Words
- Our next task will be the ARC Prize. I would be glad if you try this Compare Anti_9 on other benchmarks. If you do, please leave a score here:
| Leaderboard Name | Score |
|---|---|
Model tree for Hotblaz/Compare_Anti_9
Base model
deepseek-ai/DeepSeek-V3.2-Exp-BasePapers for Hotblaz/Compare_Anti_9
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Language Models are Few-Shot Learners
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Evaluation results
- Accuracy on GPQA Diamondtest set Community Eval92.6%