VLM OCR Pipeline¶
A modular document processing system that combines layout detection (DocLayout-YOLO, MinerU, olmOCR, PaddleOCR) with Vision Language Models (OpenAI, Gemini, PaddleOCR-VL) for intelligent text extraction and correction.
Based On
This project is based on and modified from Versatile-OCR-Program
Key Features¶
🎯 Modular Architecture¶
- Flexible Detector/Sorter Combinations: Mix and match detectors (DocLayout-YOLO, PaddleOCR PP-DocLayoutV2, MinerU, olmOCR) with sorters (multi-column, LayoutReader, XY-Cut, VLM)
- 8-Stage Pipeline: Document loading → Detection → Ordering → Recognition → Block correction → Rendering → Page correction → Output
- Unified BBox System: Integer-based bounding boxes (internal: xyxy, JSON: xywh) with automatic conversion between 6+ formats
🤖 Multiple Recognition Backends¶
- Cloud VLM APIs: OpenAI GPT-4 Vision, Gemini 2.5 Flash
- Local VLM: PaddleOCR-VL-0.9B (0.9B parameters, NaViT + ERNIE-4.5-0.3B)
- 109+ Languages: Extensive multilingual support
📄 Advanced Document Understanding¶
- Layout Detection: Automatically detects text, tables, figures, equations, lists
- Reading Order Analysis: Multi-column detection, LayoutLMv3, XY-Cut, VLM-based ordering
- Special Content Processing: Enhanced analysis of tables and figures with structured output
- AI-Powered Correction: Intelligent text correction using VLMs
âš¡ Performance & Efficiency¶
- Intelligent Caching: Content-based hashing to avoid reprocessing
- Rate Limiting: Automatic throttling for Gemini API (free/tier1/tier2/tier3)
- Batch Processing: Process entire directories of PDFs
- Model-Specific Prompts: YAML-based prompt templates for optimal results
Quick Example¶
# Basic usage with Gemini
python main.py --input document.pdf --backend gemini
# Advanced: Custom detector + sorter + recognizer
python main.py --input doc.pdf \
--detector paddleocr-doclayout-v2 \
--sorter mineru-layoutreader \
--recognizer paddleocr-vl
# Batch processing with page limits
python main.py --input pdfs/ --backend openai --max-pages 5
Architecture Overview¶
graph LR
A[Input PDF/Image] --> B[Stage 1: Load Document]
B --> C[Stage 2: Layout Detection]
C --> D[Stage 3: Reading Order]
D --> E[Stage 4: Text Recognition]
E --> F[Stage 5: Block Correction]
F --> G[Stage 6: Markdown Rendering]
G --> H[Stage 7: Page Correction]
H --> I[Stage 8: Save Output]
Each stage is modular and can be configured independently:
- Detection:
doclayout-yolo,paddleocr-doclayout-v2,mineru-vlm,olmocr-vlm - Ordering:
pymupdf,mineru-layoutreader,mineru-xycut,mineru-vlm,olmocr-vlm - Recognition:
openai,gemini,paddleocr-vl
What's Next?¶
-
:material-download:{ .lg .middle } Installation
Install VLM OCR Pipeline in minutes
-
:material-book-open-variant:{ .lg .middle } User Guides
Learn about BBox formats, detectors, and advanced usage
-
:material-api:{ .lg .middle } API Reference
Explore the complete API documentation
-
:material-file-code:{ .lg .middle } Architecture
Deep dive into pipeline stages and design patterns
Community¶
- GitHub: NoUnique/vlm-ocr-pipeline
- Issues: Report bugs or request features
- Contributing: See our Contributing Guide