Architecture Overview¶

VLM OCR Pipeline is built on a modular, 8-stage architecture that separates concerns and allows flexible configuration of each processing step.

Core Design Principles¶

1. Modular Stages¶

Each stage has a single responsibility and can be tested independently:

# Each stage is a self-contained module
input_stage = InputStage(...)
detection_stage = DetectionStage(detector)
ordering_stage = OrderingStage(sorter)
recognition_stage = RecognitionStage(recognizer)

2. Factory Pattern¶

Detectors and sorters are created through factory functions:

# Centralized creation with validation
detector = create_detector("doclayout-yolo")
sorter = create_sorter("mineru-xycut")
validate_combination(detector_name, sorter_name)  # Ensures compatibility

3. Protocol-Based Interfaces¶

Type-safe plugin system using Python protocols:

class Detector(Protocol):
    """All detectors must implement this interface."""
    def detect(self, image: np.ndarray) -> list[Block]:
        ...

class Sorter(Protocol):
    """All sorters must implement this interface."""
    def sort(self, blocks: list[Block], image: np.ndarray) -> list[Block]:
        ...

4. Unified BBox System¶

All bounding boxes use the BBox dataclass for automatic format conversion:

@dataclass
class BBox:
    x0: int  # Left
    y0: int  # Top
    x1: int  # Right
    y1: int  # Bottom

    # Automatic conversion from 6+ formats
    @classmethod
    def from_yolo(cls, bbox, image_width, image_height) -> BBox:
        ...

    @classmethod
    def from_mineru(cls, bbox) -> BBox:
        ...

Internal: xyxy (corners) JSON Output: xywh (x, y, width, height)

8-Stage Pipeline¶

graph TD
    A[📄 Stage 1: Input] --> B[🔍 Stage 2: Detection]
    B --> C[📊 Stage 3: Ordering]
    C --> D[📝 Stage 4: Recognition]
    D --> E[✏️ Stage 5: Block Correction]
    E --> F[📋 Stage 6: Rendering]
    F --> G[🔧 Stage 7: Page Correction]
    G --> H[💾 Stage 8: Output]

    style A fill:#e1f5ff
    style B fill:#fff3e1
    style C fill:#e8f5e9
    style D fill:#f3e5f5
    style E fill:#fce4ec
    style F fill:#fff9e1
    style G fill:#e0f2f1
    style H fill:#f1f8e9

Stage 1: Input¶

Responsibility: Load documents and extract auxiliary information

Renders PDF pages to images (pdf2image)
Loads image files directly (OpenCV)
Extracts text spans with font metadata (PyMuPDF)

Output: (image: np.ndarray, auxiliary_info: dict)

Stage 2: Detection¶

Responsibility: Detect layout blocks

Runs selected detector (DocLayout-YOLO, PaddleOCR, MinerU, olmOCR)
Returns blocks with bounding boxes, types, and confidence scores
Detects: text, title, table, figure, equation, list

Output: list[Block] with bbox, type, confidence

Stage 3: Ordering¶

Responsibility: Analyze reading order

Runs selected sorter (PyMuPDF, LayoutReader, XY-Cut, VLM)
Adds order field to blocks
Optionally adds column_index for multi-column documents

Output: list[Block] sorted with order field

Stage 4: Recognition¶

Responsibility: Extract text from blocks

Crops block images from full page
Sends to VLM API or local model
Uses block-type-specific prompts
Handles special content (tables, figures)

Output: list[Block] with text field populated

Stage 5: Block Correction¶

Responsibility: Block-level text correction (placeholder)

Currently disabled by default
Future: VLM-based correction at block level
Currently just copies text to corrected_text

Output: list[Block] with corrected_text

Stage 6: Rendering¶

Responsibility: Convert to output format

Assembles blocks in reading order
Generates Markdown or plaintext
Uses auxiliary info for enhanced formatting
Supports multiple rendering strategies

Output: str (Markdown/plaintext)

Stage 7: Page Correction¶

Responsibility: Page-level VLM correction

Sends entire page text to VLM
Corrects OCR errors and formatting
Calculates correction ratio
Handles rate limits
Skipped for local models

Output: (corrected_text: str, correction_ratio: float, should_stop: bool)

Stage 8: Output¶

Responsibility: Save results

Builds Page objects with metadata
Saves JSON and Markdown files
Generates document summaries
Creates output directory structure

Output: Saved files in output/{model}/{document}/

Data Flow¶

Block Evolution Through Stages¶

A block's data evolves as it passes through stages:

# After Detection (Stage 2)
Block(
    type="text",
    bbox=BBox(100, 50, 500, 120),
    detection_confidence=0.95,
    source="doclayout-yolo"
)

# After Ordering (Stage 3)
Block(
    type="text",
    bbox=BBox(100, 50, 500, 120),
    detection_confidence=0.95,
    order=0,  # Added
    column_index=0,  # Added (optional)
    source="doclayout-yolo"
)

# After Recognition (Stage 4)
Block(
    type="text",
    bbox=BBox(100, 50, 500, 120),
    detection_confidence=0.95,
    order=0,
    text="Chapter 1: Introduction",  # Added
    source="doclayout-yolo"
)

# After Block Correction (Stage 5)
Block(
    type="text",
    bbox=BBox(100, 50, 500, 120),
    detection_confidence=0.95,
    order=0,
    text="Chapter 1: Introduction",
    corrected_text="Chapter 1: Introduction",  # Added
    source="doclayout-yolo"
)

Extensibility¶

Adding a New Detector¶

Implement the Detector protocol
Register in create_detector() factory
Add to validate_combination() if needed

# pipeline/layout/detection/my_detector.py
class MyDetector:
    def detect(self, image: np.ndarray) -> list[Block]:
        # Your detection logic
        return blocks

# pipeline/layout/detection/__init__.py
def create_detector(name: str, **kwargs) -> Detector:
    if name == "my-detector":
        return MyDetector(**kwargs)

Adding a New Sorter¶

Implement the Sorter protocol
Register in create_sorter() factory
Add combination validation

# pipeline/layout/ordering/my_sorter.py
class MySorter:
    def sort(self, blocks: list[Block], image: np.ndarray) -> list[Block]:
        # Your sorting logic
        return sorted_blocks

# pipeline/layout/ordering/__init__.py
def create_sorter(name: str, **kwargs) -> Sorter:
    if name == "my-sorter":
        return MySorter(**kwargs)

Adding a New Recognizer¶

Implement the Recognizer protocol
Add to TextRecognizer backend selection

class Recognizer(Protocol):
    def process_blocks(self, image: np.ndarray, blocks: Sequence[Block]) -> list[Block]:
        ...

    def correct_text(self, text: str) -> str | dict[str, Any]:
        ...

Error Handling¶

The pipeline uses a comprehensive error handling system:

Custom exceptions: Specific exception types for different errors
Graceful degradation: Continue processing on non-critical failures
Error logging: Detailed logs with stack traces
Rate limit handling: Automatic retry and backoff

See Error Handling Guide for details.

Testing Strategy¶

Each component can be tested independently:

# Test detector
detector = create_detector("doclayout-yolo")
blocks = detector.detect(test_image)
assert len(blocks) > 0

# Test sorter
sorter = create_sorter("mineru-xycut")
sorted_blocks = sorter.sort(blocks, test_image)
assert sorted_blocks[0].order is not None

# Test full pipeline
pipeline = Pipeline(detector="doclayout-yolo", sorter="mineru-xycut")
result = pipeline.process_single_pdf(test_pdf)

Performance Considerations¶

Caching¶

The recognition stage uses content-based caching:

# Cache key = hash(block_image + block_type + prompt)
cache_key = hashlib.sha256(
    block_image.tobytes() +
    block_type.encode() +
    prompt.encode()
).hexdigest()

Rate Limiting¶

Gemini API rate limiting is handled globally:

rate_limiter.wait_if_needed(estimated_tokens=1000)
# Automatic throttling based on tier limits

Memory Management¶

Block images are deleted after recognition
Garbage collection is triggered after each block
Temporary files are cleaned up automatically

Configuration¶

Pipeline behavior is controlled through:

CLI Arguments: Runtime configuration
Environment Variables: API keys, paths
YAML Files: Prompts, rate limits
Factory Functions: Component selection

Next Steps¶

Pipeline Stages - Detailed stage documentation
Detectors - Available detection models
Sorters - Reading order algorithms
Recognizers - Text extraction backends