Garnet-OCR-3B-0422
The Garnet-OCR-3B-0422 model is a fine-tuned and optimized evolution of Megalodon-OCR-Sync-0713, built on top of the Qwen2.5-VL-3B-Instruct architecture. This version is specifically designed for high-precision mathematical formula extraction, structured markdown generation, and accurate table reconstruction, making it highly effective for technical, scientific, and structured documents. Trained on an enhanced mixture of document-centric datasets, including large-scale OCR-caption pairs and structured document corpora, the model improves layout fidelity, symbolic reasoning, and content structuring across diverse document types such as research papers, scanned PDFs, handwritten equations, and analytical reports.
GGUF: https://huggingface.co/prithivMLmods/Garnet-OCR-3B-0422-GGUF
Key Enhancements
- Advanced Mathematical Formula Extraction: Improved recognition and LaTeX conversion of complex equations, symbols, and multi-line mathematical expressions.
- High-Precision Markdown Generation: Generates clean, consistent, and semantically structured markdown outputs.
- Robust Table Reconstruction: Enhanced ability to detect, align, and reconstruct tables with accurate row-column relationships and hierarchy.
- Optimized Document Visualization and OCR Pipeline: Better understanding of layout, typography, and embedded visual elements.
- Context-Aware Multimodal Linking: Stronger alignment between text, images, and spatial document structure.
- Enhanced Document Retrieval: More accurate extraction from complex, multi-layout, and multi-page documents.
- Analytical Recognition: Improved reasoning over charts, graphs, tables, and mathematical content.
- Efficient 3B Architecture Optimization: Maintains strong performance while being lighter and more efficient than larger variants.
- Extended Multimodal Support: Handles long document sequences and extended visual inputs, including videos.
- Stability and Consistency Improvements: Refined outputs with reduced formatting errors and hallucinations.
Quick Start with Transformers
Quick start with a Colab notebook: Garnet-OCR-3B-0422-4bit-Demo
pip install transformers==5.6.0
# or
pip install git+https://github.com/huggingface/transformers.git
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"prithivMLmods/Garnet-OCR-3B-0422", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("prithivMLmods/Garnet-OCR-3B-0422")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{
"type": "text",
"text": "Extract equations, reconstruct tables, and convert the document into structured markdown.",
},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Training Details
| Parameter | Details |
|---|---|
| Model Name | Garnet-OCR-3B-0422 |
| Model State | Experimental |
| Model Architecture | Qwen2_5_VLForConditionalGeneration |
| Base Model | prithivMLmods/Megalodon-OCR-Sync-0713 |
| Dataset Size | ~12K samples (Modular combination of datasets) |
| Dataset Composition | ~75% from allenai/olmOCR-mix-0225, remaining from mixed OCR and document datasets |
| Training Time | 4,320 (±250) seconds (~1.20 hours) |
| Hardware | NVIDIA A100-SXM4-80GB |
| Training Framework | TRL (Transformers Reinforcement Learning) |
| Transformers Version | 5.6.0 |
| Compute Service | Hugging Face Jobs / Spaces |
| Warmup Steps | 750 |
| Precision | bfloat16 |
Intended Use
- Mathematical formula extraction and LaTeX conversion.
- High-quality markdown generation from documents.
- Accurate table detection and reconstruction.
- OCR and structured data extraction from technical documents.
- Scientific, academic, and research document processing.
- Document-based QA and summarization.
- Multimodal document understanding and reasoning.
- Automation workflows requiring structured document parsing.
Limitations
- Performance may degrade on heavily distorted or low-resolution inputs.
- Still computationally demanding for real-time edge deployment.
- Reduced accuracy on low-resource languages, including some Indian languages.
- Complex layouts may occasionally produce minor alignment errors.
- Output quality depends on visual token configuration and context length.
- Rare cases of hallucinated or structurally inconsistent outputs may occur.
- Downloads last month
- 9
