Garnet-OCR-7B-0422

The Garnet-OCR-7B-0422 model is a refined and highly optimized evolution of Gliese-OCR-7B-Post2.0-final, built upon the Qwen2.5-VL architecture. This release focuses on high-precision mathematical formula extraction, structured markdown generation, and accurate table reconstruction, making it especially effective for technical and scientific documents. Fine-tuned with enhanced datasets targeting mathematical notation, document structure, and layout fidelity, the model delivers superior performance across complex documents including research papers, scanned PDFs, handwritten equations, structured forms, and analytical reports.

GGUF: https://huggingface.co/prithivMLmods/Garnet-OCR-7B-Post-0422-GGUF

Key Enhancements

Advanced Mathematical Formula Extraction: Significantly improved recognition and LaTeX conversion of complex equations, symbols, and multi-line expressions.
High-Precision Markdown Generation: Produces clean, structured, and semantically accurate markdown outputs for documents.
Robust Table Reconstruction: Enhanced detection and reconstruction of tables with correct alignment, hierarchy, and formatting.
Optimized Document Visualization and OCR Pipeline: Improved understanding of text, layout, and embedded visuals for structured document parsing.
Context-Aware Multimodal Linking: Stronger alignment between textual, visual, and spatial elements within documents.
High-Fidelity Content Extraction: Accurate extraction of structured, semi-structured, and unstructured data with normalization.
Analytical Recognition: Improved reasoning over charts, graphs, tables, and mathematical content.
Enhanced Layout Awareness: Better spatial and semantic comprehension for complex document structures.
Extended Multimodal Duration Support: Supports long document sequences and extended visual inputs.
Stable Final Optimization: Consolidated improvements for reliable and consistent outputs.

Quick Start with Transformers

Quick start with a Colab notebook: Garnet-OCR-7B-0422-4bit-Demo

pip install transformers==5.6.0
# or
pip install git+https://github.com/huggingface/transformers.git

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Garnet-OCR-7B-0422", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/Garnet-OCR-7B-0422")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "text", "text": "Extract equations, reconstruct tables, and convert the document into structured markdown."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output_text)

Training Details

Parameter	Details
Model Name	Garnet-OCR-7B-0422
Model State	Experimental
Model Architecture	`Qwen2_5_VLForConditionalGeneration`
Base Model	prithivMLmods/Gliese-OCR-7B-Post2.0-final
Dataset Size	~30K samples (Modular combination of datasets)
Dataset Composition	~20% from allenai/olmOCR-mix-0225, remaining from mixed OCR and document datasets
Training Time	9,240 (±250) seconds (~2.57 hours)
Hardware	NVIDIA A100-SXM4-80GB
Training Framework	TRL (Transformers Reinforcement Learning)
Transformers Version	5.6.0
Compute Service	Hugging Face Jobs / Spaces
Warmup Steps	750
Precision	bfloat16