Installation¶

This guide will help you set up VLM OCR Pipeline on your system.

Prerequisites¶

Python 3.11+: Recommended for best compatibility
uv: Fast Python package manager (optional but recommended)
Git: For cloning the repository and managing submodules

Quick Installation¶

1. Clone the Repository¶

git clone https://github.com/NoUnique/vlm-ocr-pipeline.git
cd vlm-ocr-pipeline

2. Set Up Python Environment¶

Using uv (Recommended)Using pip

# Install uv if you haven't
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment
uv venv --python 3.11 .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
uv pip install -r requirements.txt

# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

3. Fix DocLayout-YOLO Compatibility¶

python setup.py

What does setup.py do?

The setup.py script fixes compatibility issues between DocLayout-YOLO and the ultralytics package by modifying the YOLO model files.

API Configuration¶

Choose the backend you want to use and configure the corresponding API keys.

Gemini API (Recommended for Free Tier)¶

Visit Google AI Studio
Sign in with your Google account
Click "Create API Key"
Copy the generated key
Set environment variable:

export GEMINI_API_KEY="your_api_key_here"

Free Tier Limits (as of 2024): - 15 requests per minute (RPM) - 1,500,000 tokens per minute (TPM) - 1,500 requests per day (RPD)

OpenAI API¶

Visit OpenAI Platform
Create an API key
Set environment variable:

export OPENAI_API_KEY="your_api_key_here"

OpenRouter API (Alternative)¶

OpenRouter provides access to multiple VLMs through a single API:

export OPENROUTER_API_KEY="your_api_key_here"

Optional Components¶

PaddleOCR-VL (Local Recognition)¶

For local text recognition without API calls:

# PaddleX is already included as a submodule
cd external/PaddleX
git checkout v3.3.1
pip install -e .

Requirements: - GPU with CUDA support (recommended) - ~4GB VRAM for PaddleOCR-VL-0.9B

External Frameworks (Git Submodules)¶

The project includes several external frameworks as submodules:

# Initialize all submodules
git submodule update --init --recursive

# Or initialize specific submodules
git submodule update --init external/MinerU      # MinerU detectors/sorters
git submodule update --init external/olmocr      # olmOCR VLM sorter
git submodule update --init external/PaddleOCR   # PP-DocLayoutV2 detector
git submodule update --init external/PaddleX     # PaddleOCR-VL recognizer

Verification¶

Verify your installation:

# Check Python version
python --version  # Should be 3.11+

# Check dependencies
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import cv2; print(f'OpenCV: {cv2.__version__}')"

# Run a simple test
python main.py --help

Troubleshooting¶

CUDA/GPU Issues¶

If you encounter CUDA-related errors:

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# Install CPU-only PyTorch (if no GPU)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

Ultralytics/YOLO Errors¶

If you see errors about ultralytics or YOLO models:

# Re-run the setup script
python setup.py

# Or manually reinstall ultralytics
pip uninstall ultralytics
pip install ultralytics==8.2.0

Missing Dependencies¶

If you encounter import errors:

# Reinstall all dependencies
pip install -r requirements.txt --force-reinstall

Next Steps¶

Now that you have VLM OCR Pipeline installed:

Quick Start Guide - Run your first OCR pipeline
Basic Usage - Learn the core features
Architecture Overview - Understand how it works