1

Garnet-OCR-7B-0422

The Garnet-OCR-7B-0422 model is a refined and highly optimized evolution of Gliese-OCR-7B-Post2.0-final, built upon the Qwen2.5-VL architecture. This release focuses on high-precision mathematical formula extraction, structured markdown generation, and accurate table reconstruction, making it especially effective for technical and scientific documents. Fine-tuned with enhanced datasets targeting mathematical notation, document structure, and layout fidelity, the model delivers superior performance across complex documents including research papers, scanned PDFs, handwritten equations, structured forms, and analytical reports.

GGUF: https://huggingface.co/prithivMLmods/Garnet-OCR-7B-Post-0422-GGUF

Key Enhancements

  • Advanced Mathematical Formula Extraction: Significantly improved recognition and LaTeX conversion of complex equations, symbols, and multi-line expressions.
  • High-Precision Markdown Generation: Produces clean, structured, and semantically accurate markdown outputs for documents.
  • Robust Table Reconstruction: Enhanced detection and reconstruction of tables with correct alignment, hierarchy, and formatting.
  • Optimized Document Visualization and OCR Pipeline: Improved understanding of text, layout, and embedded visuals for structured document parsing.
  • Context-Aware Multimodal Linking: Stronger alignment between textual, visual, and spatial elements within documents.
  • High-Fidelity Content Extraction: Accurate extraction of structured, semi-structured, and unstructured data with normalization.
  • Analytical Recognition: Improved reasoning over charts, graphs, tables, and mathematical content.
  • Enhanced Layout Awareness: Better spatial and semantic comprehension for complex document structures.
  • Extended Multimodal Duration Support: Supports long document sequences and extended visual inputs.
  • Stable Final Optimization: Consolidated improvements for reliable and consistent outputs.

Quick Start with Transformers

Quick start with a Colab notebook: Garnet-OCR-7B-0422-4bit-Demo

pip install transformers==5.6.0
# or
pip install git+https://github.com/huggingface/transformers.git
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Garnet-OCR-7B-0422", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/Garnet-OCR-7B-0422")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "text", "text": "Extract equations, reconstruct tables, and convert the document into structured markdown."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output_text)

Training Details

Parameter Details
Model Name Garnet-OCR-7B-0422
Model State Experimental
Model Architecture Qwen2_5_VLForConditionalGeneration
Base Model prithivMLmods/Gliese-OCR-7B-Post2.0-final
Dataset Size ~30K samples (Modular combination of datasets)
Dataset Composition ~20% from allenai/olmOCR-mix-0225, remaining from mixed OCR and document datasets
Training Time 9,240 (±250) seconds (~2.57 hours)
Hardware NVIDIA A100-SXM4-80GB
Training Framework TRL (Transformers Reinforcement Learning)
Transformers Version 5.6.0
Compute Service Hugging Face Jobs / Spaces
Warmup Steps 750
Precision bfloat16

Intended Use

  • Mathematical formula extraction and LaTeX generation.
  • High-quality markdown conversion from documents.
  • Accurate table detection and reconstruction.
  • Document OCR and structured data extraction.
  • Scientific, academic, and technical document processing.
  • Multimodal document understanding and reasoning.
  • Summarization and QA over structured documents.

Limitations

  • Reduced accuracy on heavily degraded or low-quality scans.
  • High computational requirements for large-scale inference.
  • Limited optimization for edge devices.
  • Occasional layout misalignment in highly complex documents.
  • Performance sensitivity to visual token configuration and long context inputs.
Downloads last month
12
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prithivMLmods/Garnet-OCR-7B-0422

Datasets used to train prithivMLmods/Garnet-OCR-7B-0422

Collection including prithivMLmods/Garnet-OCR-7B-0422