Quick Start¶

Get up and running with VLM OCR Pipeline in 5 minutes!

Prerequisites¶

Python 3.11+ installed
Gemini API key (free tier available)

Step 1: Install¶

# Clone the repository
git clone https://github.com/NoUnique/vlm-ocr-pipeline.git
cd vlm-ocr-pipeline

# Set up environment
uv venv --python 3.11 .venv
source .venv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

# Fix YOLO compatibility
python setup.py

Step 2: Configure API¶

# Set Gemini API key
export GEMINI_API_KEY="your_api_key_here"

Get a Free Gemini API Key

Visit Google AI Studio to get a free API key.

Step 3: Run Your First Pipeline¶

# Process a single PDF
python main.py --input document.pdf --backend gemini

That's it! The pipeline will:

📄 Load your PDF and render each page as an image
🔍 Detect layout blocks (text, tables, figures, etc.)
📊 Analyze reading order
📝 Extract text using Gemini Vision
🔧 Correct and improve text quality
💾 Save results to output/gemini-2.5-flash/document/

Understanding the Output¶

After processing, you'll find:

output/
└── gemini-2.5-flash/
    └── document/
        ├── page_1.json          # Detailed page data
        ├── page_1.md            # Markdown output
        └── document_summary.json  # Processing metadata

Example Output¶

page_1.md:

# Introduction

This document describes...

## Table of Contents

1. Getting Started
2. Advanced Features
3. API Reference

page_1.json:

{
  "page_num": 1,
  "text": "# Introduction\n\nThis document describes...",
  "corrected_text": "# Introduction\n\nThis document describes...",
  "correction_ratio": 0.05,
  "blocks": [
    {
      "type": "title",
      "bbox": [100, 50, 500, 120],
      "text": "Introduction",
      "order": 0
    }
  ]
}

Common Use Cases¶

Single Image¶

python main.py --input photo.jpg --backend gemini

Batch Processing¶

# Process all PDFs in a directory
python main.py --input documents/ --backend gemini

Limit Pages (for testing)¶

# Process only first 5 pages
python main.py --input document.pdf --backend gemini --max-pages 5

# Process specific page range
python main.py --input document.pdf --backend gemini --page-range 10-20

# Process specific pages
python main.py --input document.pdf --backend gemini --pages 1,5,10

Use OpenAI Instead¶

# Set OpenAI API key
export OPENAI_API_KEY="your_api_key_here"

# Run with OpenAI backend
python main.py --input document.pdf --backend openai --model gpt-4o

Local Processing (No API)¶

# Use PaddleOCR-VL (local model, no API calls)
python main.py --input document.pdf \
    --detector paddleocr-doclayout-v2 \
    --recognizer paddleocr-vl

Rate Limiting¶

The pipeline automatically handles rate limits for Gemini API:

# Check current rate limit status
python main.py --rate-limit-status --backend gemini --gemini-tier free

Free Tier Limits: - 15 requests per minute - 1,500,000 tokens per minute - 1,500 requests per day

The pipeline will automatically wait when limits are reached.

Troubleshooting¶

"GEMINI_API_KEY not set"¶

export GEMINI_API_KEY="your_api_key_here"

"Rate limit exceeded"¶

The pipeline will automatically wait. Alternatively:

# Use OpenAI instead
python main.py --input doc.pdf --backend openai

# Or use local model (no API)
python main.py --input doc.pdf --recognizer paddleocr-vl

"CUDA out of memory"¶

If using PaddleOCR-VL locally:

# Reduce batch size or use CPU
export CUDA_VISIBLE_DEVICES=""  # Force CPU mode

Next Steps¶

Now that you've run your first pipeline:

Basic Usage Guide - Learn about all available options
Architecture Overview - Understand how the pipeline works
Advanced Examples - Complex use cases and customizations

Tips for Best Results¶

Optimize Processing

Use --max-pages to test on a few pages first
Check rate limit status regularly for Gemini
Use caching to avoid reprocessing identical content

API Costs

Gemini free tier has daily limits
OpenAI charges per token
Consider using PaddleOCR-VL for large batch processing

Performance

DocLayout-YOLO is fastest for detection
PaddleOCR-VL provides good quality without API costs
Gemini 2.5 Flash is fast and cost-effective