Garnet-OCR-3B-0422

The Garnet-OCR-3B-0422 model is a fine-tuned and optimized evolution of Megalodon-OCR-Sync-0713, built on top of the Qwen2.5-VL-3B-Instruct architecture. This version is specifically designed for high-precision mathematical formula extraction, structured markdown generation, and accurate table reconstruction, making it highly effective for technical, scientific, and structured documents. Trained on an enhanced mixture of document-centric datasets, including large-scale OCR-caption pairs and structured document corpora, the model improves layout fidelity, symbolic reasoning, and content structuring across diverse document types such as research papers, scanned PDFs, handwritten equations, and analytical reports.

GGUF: https://huggingface.co/prithivMLmods/Garnet-OCR-3B-0422-GGUF

Key Enhancements

Advanced Mathematical Formula Extraction: Improved recognition and LaTeX conversion of complex equations, symbols, and multi-line mathematical expressions.
High-Precision Markdown Generation: Generates clean, consistent, and semantically structured markdown outputs.
Robust Table Reconstruction: Enhanced ability to detect, align, and reconstruct tables with accurate row-column relationships and hierarchy.
Optimized Document Visualization and OCR Pipeline: Better understanding of layout, typography, and embedded visual elements.
Context-Aware Multimodal Linking: Stronger alignment between text, images, and spatial document structure.
Enhanced Document Retrieval: More accurate extraction from complex, multi-layout, and multi-page documents.
Analytical Recognition: Improved reasoning over charts, graphs, tables, and mathematical content.
Efficient 3B Architecture Optimization: Maintains strong performance while being lighter and more efficient than larger variants.
Extended Multimodal Support: Handles long document sequences and extended visual inputs, including videos.
Stability and Consistency Improvements: Refined outputs with reduced formatting errors and hallucinations.

Quick Start with Transformers

Quick start with a Colab notebook: Garnet-OCR-3B-0422-4bit-Demo

pip install transformers==5.6.0
# or
pip install git+https://github.com/huggingface/transformers.git

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Garnet-OCR-3B-0422", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/Garnet-OCR-3B-0422")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {
                "type": "text",
                "text": "Extract equations, reconstruct tables, and convert the document into structured markdown.",
            },
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Training Details

Parameter	Details
Model Name	Garnet-OCR-3B-0422
Model State	Experimental
Model Architecture	`Qwen2_5_VLForConditionalGeneration`
Base Model	prithivMLmods/Megalodon-OCR-Sync-0713
Dataset Size	~12K samples (Modular combination of datasets)
Dataset Composition	~75% from allenai/olmOCR-mix-0225, remaining from mixed OCR and document datasets
Training Time	4,320 (±250) seconds (~1.20 hours)
Hardware	NVIDIA A100-SXM4-80GB
Training Framework	TRL (Transformers Reinforcement Learning)
Transformers Version	5.6.0
Compute Service	Hugging Face Jobs / Spaces
Warmup Steps	750
Precision	bfloat16