Recognition API¶
Text recognition and correction using Vision Language Models.
Overview¶
The recognition module provides:
- TextRecognizer: Main class for text extraction from blocks
- Multiple backends: OpenAI, Gemini, PaddleOCR-VL, DeepSeek-OCR
- Caching: Content-based caching to avoid reprocessing
- Rate limiting: Automatic throttling for API backends
Quick Start¶
from pipeline.recognition import TextRecognizer
import numpy as np
# Create recognizer with Gemini backend
recognizer = TextRecognizer(
backend="gemini",
model="gemini-2.5-flash",
use_cache=True,
)
# Process blocks
image = np.zeros((1000, 800, 3), dtype=np.uint8)
processed_blocks = recognizer.process_blocks(image, blocks)
# Correct text
corrected = recognizer.correct_text(raw_text)
TextRecognizer¶
Constructor¶
class TextRecognizer:
def __init__(
self,
backend: str = "gemini",
model: str = "gemini-2.5-flash",
use_cache: bool = True,
cache_dir: str | Path = ".cache",
gemini_tier: str = "free",
recognizer_backend: str | None = None,
prompts_dir: str | Path | None = None,
**kwargs,
):
"""Initialize text recognizer.
Args:
backend: API backend ("gemini", "openai", "paddleocr-vl", "deepseek-ocr")
model: Model name
use_cache: Enable content-based caching
cache_dir: Cache directory path
gemini_tier: Gemini API tier for rate limiting
recognizer_backend: Inference backend for local models
prompts_dir: Custom prompts directory
"""
Methods¶
process_blocks¶
def process_blocks(
self,
image: np.ndarray,
blocks: Sequence[Block],
) -> list[Block]:
"""Extract text from blocks.
Args:
image: Full page image as numpy array
blocks: List of blocks to process
Returns:
List of blocks with text field populated
"""
correct_text¶
def correct_text(self, text: str) -> str | dict[str, Any]:
"""Correct extracted text using VLM.
Args:
text: Raw extracted text
Returns:
Corrected text string, or dict with:
- corrected_text: str
- correction_ratio: float (0.0 = no change)
"""
Available Backends¶
Cloud VLM APIs¶
| Backend | Models | Rate Limits | Cost |
|---|---|---|---|
gemini |
gemini-2.5-flash, gemini-2.0-flash | 15 RPM (free) | Free tier available |
openai |
gpt-4o, gpt-4-turbo | Varies by tier | Pay per token |
openrouter |
Multiple VLMs | Varies by model | Pay per token |
Local Models¶
| Backend | Model | Parameters | Languages |
|---|---|---|---|
paddleocr-vl |
PaddleOCR-VL-0.9B | 0.9B | 109 languages |
deepseek-ocr |
DeepSeek-OCR | - | Contextual compression |
Backend Configuration¶
Gemini Backend¶
recognizer = TextRecognizer(
backend="gemini",
model="gemini-2.5-flash",
gemini_tier="free", # free, tier1, tier2, tier3
)
Environment Variable: GEMINI_API_KEY
Rate Limits (Free Tier): - 15 requests per minute - 1,500,000 tokens per minute - 1,500 requests per day
OpenAI Backend¶
Environment Variable: OPENAI_API_KEY
PaddleOCR-VL Backend¶
recognizer = TextRecognizer(
backend="paddleocr-vl",
recognizer_backend="pytorch", # pytorch, vllm, sglang
)
Requirements: GPU recommended, PaddleX installation
DeepSeek-OCR Backend¶
Recognizer Protocol¶
All recognizers implement the Recognizer protocol:
from typing import Protocol, Any, Sequence
import numpy as np
from pipeline.types import Block
class Recognizer(Protocol):
"""Protocol for text recognizers."""
def process_blocks(
self,
image: np.ndarray,
blocks: Sequence[Block],
) -> list[Block]:
"""Extract text from blocks."""
...
def correct_text(self, text: str) -> str | dict[str, Any]:
"""Correct extracted text."""
...
Caching¶
The recognizer uses content-based caching to avoid reprocessing:
# Cache key = hash(block_image + block_type + prompt)
recognizer = TextRecognizer(
backend="gemini",
use_cache=True,
cache_dir=".cache",
)
Implementing Custom Recognizers¶
from pipeline.types import Block
from typing import Sequence, Any
import numpy as np
class MyRecognizer:
"""Custom recognizer implementation."""
def __init__(self, model_path: str):
self.model = load_model(model_path)
def process_blocks(
self,
image: np.ndarray,
blocks: Sequence[Block],
) -> list[Block]:
"""Extract text from blocks."""
result_blocks = []
for block in blocks:
cropped = block.bbox.crop(image)
text = self.model.recognize(cropped)
block.text = text
result_blocks.append(block)
return result_blocks
def correct_text(self, text: str) -> dict[str, Any]:
"""Correct extracted text."""
corrected = self.model.correct(text)
return {
"corrected_text": corrected,
"correction_ratio": calculate_ratio(text, corrected),
}
CLI Usage¶
# Default recognizer (gemini)
python main.py --input doc.pdf
# Specific recognizer
python main.py --input doc.pdf --recognizer gpt-4o
# Local model with backend
python main.py --input doc.pdf --recognizer paddleocr-vl --recognizer-backend vllm
# Check rate limit status
python main.py --rate-limit-status --recognizer gemini-2.5-flash --gemini-tier free
See Also¶
- Recognizers Architecture - Detailed backend comparison
- Basic Usage - Usage examples
- Types API - Block class reference