Atto: Extreme Intelligence Density Research
Atto is an exploration into the fundamental limits of Intelligence Density — how much knowledge and generative capability can be packed into a neural network with a strictly limited parameter budget.
This project focuses on the "sub-kiloparameter" and "low-kiloparameter" regime, training models to generate Shakespearean text with as few as 64 parameters.
The Atto Series
| Model | Parameters | Context | Weights Size (JSON) | Val Loss |
|---|---|---|---|---|
| atto-64 | 64 | 3 | 1.8 KB | 2.59 |
| atto-128 | 128 | 7 | 3.5 KB | 2.83 |
| atto-256 | 256 | 8 | 6.0 KB | 2.33 |
| atto-512 | 512 | 16 | 11.8 KB | 2.44 |
| atto-1024 | 1,024 | 8 | 22.3 KB | 2.11 |
| atto-2048 | 2,048 | 24 | 44.3 KB | 2.15 |
| atto-4096 | 4,096 | 56 | 86.4 KB | 2.40 |
| atto-8192 | 8,192 | 28 | 172.7 KB | 1.91 |
| atto-16384 | 16,384 | 60 | ~640 KB | 2.11 |
Research Findings: Intelligence Density
- Architecture Matters: At the sub-1000 parameter scale, standard Transformers are highly inefficient due to the overhead of Attention and LayerNorm. Our custom Neural N-Gram (AttoLM) architecture ensures that every single parameter directly participates in character prediction.
- The Embedding Threshold: We found that moving from 8-dimensional to 16-dimensional embeddings (at 8,192 parameters) creates a significant jump in coherence, allowing the model to represent complex character relationships.
- Context vs. Width: In extremely small models, there is a sharp trade-off between the context window (memory) and embedding dimensionality (representation). Our 8,192 and 16,384 models prioritize a balance that favors realistic word formation.
Next Steps
This is just a first step in making intelligence very dense. By optimizing weight initialization, custom activation functions, and even more extreme parameter-tying, we believe it is possible to achieve "readable Shakespeare" with even fewer than 1,000 parameters.
Usage
Training
To train the base series, run:
python3 train_atto.py
Sampling
To evaluate all trained models:
python3 sample.py
The models are exported as dependency-free JSON files in the models/ directory, ready for client-side inference in a web browser.
Sample generations:
============================================================
atto-8192 | 8192 params | embd=16 ctx=28 vocab=64
============================================================
prompt="the":
Math Laer axfourith tipht's gord me hour hace (remaat ond,
I'll wore ser ar now pre's for word to styous the mall, stpoul folthis yow apt and be a
prompt="to be":
CPon. How gue. O- whut feathent. Thou the in ap bast. gos A thing of be rith nosset?
[Tiths that hintend kyele in younk hore;
Gat sgees wis
prompt="Ham":
. HaCleata,
Wlotsef yow preerant fore thipe matte of iche in you?
And spour, the tang offe herees welr then[foritr her veut arve id for houn w