Skip to content

Basic Usage

Learn the core features and command-line options of VLM OCR Pipeline.

Command-Line Interface

The main entry point is main.py, which provides a comprehensive CLI.

Basic Syntax

python main.py [OPTIONS]

Input Options

Single File

# Process a PDF
python main.py --input document.pdf --backend gemini

# Process an image
python main.py --input photo.jpg --backend gemini

Batch Processing

# Process all PDFs in a directory
python main.py --input documents/ --backend gemini

Page Limiting

Control which pages to process:

Process first N pages only:

python main.py --input doc.pdf --backend gemini --max-pages 5

Process a specific range (inclusive):

python main.py --input doc.pdf --backend gemini --page-range 10-20

Process selected pages:

python main.py --input doc.pdf --backend gemini --pages 1,5,10,15

Backend Selection

Cloud VLM APIs

Google's Gemini API (free tier available):

export GEMINI_API_KEY="your_key"
python main.py --input doc.pdf --backend gemini --model gemini-2.5-flash

Tier Options: free, tier1, tier2, tier3

python main.py --input doc.pdf --backend gemini --gemini-tier free

OpenAI's GPT-4 Vision:

export OPENAI_API_KEY="your_key"
python main.py --input doc.pdf --backend openai --model gpt-4o

Access multiple VLMs through OpenRouter:

export OPENROUTER_API_KEY="your_key"
python main.py --input doc.pdf --backend openai --model google/gemini-2.5-flash

Local Recognition

PaddleOCR-VL (no API required):

python main.py --input doc.pdf --recognizer paddleocr-vl

Detector Selection

Choose the layout detection model:

Detector Source Speed Quality Use Case
doclayout-yolo This project ⚡⚡⚡ ⭐⭐⭐ Default, fast
mineru-doclayout-yolo MinerU ⚡⚡ ⭐⭐⭐ MinerU pipeline
paddleocr-doclayout-v2 PaddleOCR ⚡⚡ ⭐⭐⭐⭐ High quality
mineru-vlm MinerU ⭐⭐⭐⭐ VLM-based
olmocr-vlm olmOCR ⭐⭐⭐⭐ VLM-based

Examples:

# Default detector (doclayout-yolo)
python main.py --input doc.pdf --backend gemini

# High-quality detector
python main.py --input doc.pdf --detector paddleocr-doclayout-v2 --backend gemini

# VLM-based detection
python main.py --input doc.pdf --detector mineru-vlm --backend gemini

Sorter Selection

Choose the reading order algorithm:

Sorter Algorithm Speed Multi-Column Use Case
pymupdf Font analysis ⚡⚡⚡ Multi-column docs
mineru-xycut XY-Cut ⚡⚡⚡ Simple layouts
mineru-layoutreader LayoutLMv3 ⚡⚡ Complex layouts
mineru-vlm VLM reasoning Very complex
olmocr-vlm VLM reasoning Research papers
paddleocr-doclayout-v2 Pointer network ⚡⚡ With PP-DocLayoutV2

Examples:

# Multi-column documents
python main.py --input doc.pdf --sorter pymupdf --backend gemini

# Complex academic papers
python main.py --input paper.pdf --sorter mineru-layoutreader --backend gemini

# VLM-based ordering
python main.py --input doc.pdf --sorter olmocr-vlm --backend gemini

Detector + Sorter Combinations

Not all combinations are valid. The pipeline validates compatibility:

# Fast general-purpose
python main.py --input doc.pdf \
    --detector doclayout-yolo \
    --sorter mineru-xycut \
    --backend gemini

# High quality multi-column
python main.py --input doc.pdf \
    --detector paddleocr-doclayout-v2 \
    --sorter pymupdf \
    --backend gemini

# Maximum quality (slower)
python main.py --input doc.pdf \
    --detector mineru-vlm \
    --sorter mineru-vlm \
    --backend gemini

# Full PaddleOCR pipeline (local)
python main.py --input doc.pdf \
    --detector paddleocr-doclayout-v2 \
    --recognizer paddleocr-vl

Invalid Combinations

  • paddleocr-doclayout-v2 detector auto-selects its sorter (cannot override)
  • VLM detectors (mineru-vlm, olmocr-vlm) require matching VLM sorters

Output Options

Output Directory

# Custom output directory
python main.py --input doc.pdf --backend gemini --output results/

Default: output/{model}/{filename}/

Example: output/gemini-2.5-flash/document/page_1.json

Cache Control

# Disable caching
python main.py --input doc.pdf --backend gemini --no-cache

# Custom cache directory
python main.py --input doc.pdf --backend gemini --cache-dir .my-cache/

Rate Limiting (Gemini)

Check Status

python main.py --rate-limit-status --backend gemini --gemini-tier free

Output:

=== Gemini API Rate Limit Status ===
Tier: free
Model: gemini-2.5-flash

Current Limits:
  RPM (Requests Per Minute): 2 / 15 (13.3%)
  TPM (Tokens Per Minute): 45,234 / 1,500,000 (3.0%)
  RPD (Requests Per Day): 156 / 1,500 (10.4%)

Tier Configuration

# Free tier (default)
python main.py --input doc.pdf --backend gemini --gemini-tier free

# Paid tiers (higher limits)
python main.py --input doc.pdf --backend gemini --gemini-tier tier1
python main.py --input doc.pdf --backend gemini --gemini-tier tier2

Advanced Options

DPI Settings

For PDF rendering quality:

# Higher DPI = better quality, larger images
python main.py --input doc.pdf --backend gemini --dpi 300  # Default: 200

Temporary Files

# Custom temp directory
python main.py --input doc.pdf --backend gemini --temp-dir /tmp/ocr/

Logging

# Verbose output
python main.py --input doc.pdf --backend gemini -v

# Very verbose (debug level)
python main.py --input doc.pdf --backend gemini -vv

Common Workflows

Academic Papers

# High-quality processing for research papers
python main.py --input paper.pdf \
    --detector paddleocr-doclayout-v2 \
    --sorter mineru-layoutreader \
    --backend gemini \
    --dpi 300

Multi-Column Magazines

# Multi-column layout detection
python main.py --input magazine.pdf \
    --detector doclayout-yolo \
    --sorter pymupdf \
    --backend gemini

Large Batch Processing

# Process many PDFs with local model
python main.py --input documents/ \
    --detector paddleocr-doclayout-v2 \
    --recognizer paddleocr-vl \
    --max-pages 10  # Limit for testing

Cost-Optimized Processing

# Use Gemini free tier + caching
python main.py --input doc.pdf \
    --backend gemini \
    --gemini-tier free \
    --cache-dir .cache/

Output Structure

After processing, you'll find:

output/
└── {model}/              # e.g., gemini-2.5-flash/
    └── {document}/       # e.g., research_paper/
        ├── page_1.json   # Detailed page data
        ├── page_1.md     # Markdown output
        ├── page_2.json
        ├── page_2.md
        └── {document}_summary.json  # Processing metadata

JSON Structure

{
  "page_num": 1,
  "image_size": [1650, 2200],
  "text": "# Title\n\nBody text...",
  "corrected_text": "# Title\n\nBody text...",
  "correction_ratio": 0.02,
  "processing_stopped": false,
  "blocks": [
    {
      "type": "title",
      "bbox": [100, 50, 500, 120],
      "detection_confidence": 0.95,
      "order": 0,
      "column_index": null,
      "text": "Title",
      "corrected_text": "Title",
      "source": "doclayout-yolo"
    }
  ],
  "auxiliary_info": {
    "text_spans": [...]  # Font metadata for markdown
  }
}

Next Steps