Pipeline Stages¶

Detailed documentation for each of the 8 pipeline stages.

For a high-level overview, see Architecture Overview.

Stage Overview¶

graph TD
    A[Stage 1: Input] --> B[Stage 2: Detection]
    B --> C[Stage 3: Ordering]
    C --> D[Stage 4: Recognition]
    D --> E[Stage 5: Block Correction]
    E --> F[Stage 6: Rendering]
    F --> G[Stage 7: Page Correction]
    G --> H[Stage 8: Output]

    style A fill:#e1f5ff
    style B fill:#fff3e1
    style C fill:#e8f5e9
    style D fill:#f3e5f5
    style E fill:#fce4ec
    style F fill:#fff9e1
    style G fill:#e0f2f1
    style H fill:#f1f8e9

Stage 1: Input¶

File: pipeline/stages/input_stage.py

Responsibility: Load documents and extract auxiliary information

What It Does¶

Renders PDF pages to images using pdf2image (configurable DPI)
Loads image files directly using OpenCV
Extracts text spans with font metadata from PDFs (PyMuPDF)
Supports dual resolution mode (lower DPI for detection, higher for recognition)

Input/Output¶

Input	Output
PDF path + page number	`np.ndarray` (image)
Image path	`np.ndarray` (image)
PDF path	`dict` (auxiliary_info with text_spans)

Key Methods¶

class InputStage:
    def load_pdf_page(self, pdf_path: Path, page_num: int) -> np.ndarray:
        """Render PDF page to image."""

    def load_image(self, image_path: Path) -> np.ndarray:
        """Load image file."""

    def extract_auxiliary_info(self, pdf_path: Path, page_num: int) -> dict:
        """Extract text spans with font info from PDF."""

Configuration¶

Parameter	Default	Description
`dpi`	200	Single resolution DPI
`detection_dpi`	150	Detection DPI (dual mode)
`recognition_dpi`	300	Recognition DPI (dual mode)
`use_dual_resolution`	False	Enable dual resolution mode

Stage 2: Detection¶

File: pipeline/stages/detection_stage.py

Responsibility: Detect layout blocks in page images

What It Does¶

Runs selected detector (DocLayout-YOLO, PaddleOCR, MinerU)
Returns blocks with bounding boxes, types, and confidence scores
Extracts column layout information if available
Supports Ray-based distributed detection for multi-GPU

Input/Output¶

Input	Output
`np.ndarray` (image)	`list[Block]`

Block Data After Detection¶

Block(
    type="text",  # or "title", "table", "image", etc.
    bbox=BBox(100, 50, 500, 120),
    detection_confidence=0.95,
    source="doclayout-yolo",
)

Available Detectors¶

Detector	Speed	Quality	Block Types
`doclayout-yolo`	Fast	Good	7 types
`paddleocr-doclayout-v2`	Medium	Very Good	25 types
`mineru-doclayout-yolo`	Fast	Good	10 types
`mineru-vlm`	Slow	Excellent	25+ types

Stage 3: Ordering¶

File: pipeline/stages/ordering_stage.py

Responsibility: Analyze reading order of detected blocks

What It Does¶

Runs selected sorter algorithm
Adds order field to blocks for correct reading sequence
Optionally adds column_index for multi-column documents
Scales bounding boxes if using dual resolution mode

Input/Output¶

Input	Output
`list[Block]` + image	`list[Block]` (sorted, with `order` field)

Block Data After Ordering¶

Block(
    type="text",
    bbox=BBox(100, 50, 500, 120),
    detection_confidence=0.95,
    order=0,  # Added by sorter
    column_index=0,  # Added by multi-column sorters
    source="doclayout-yolo",
)

Available Sorters¶

Sorter	Algorithm	Multi-Column	Speed
`pymupdf`	Font analysis	Yes	Fast
`mineru-xycut`	XY-Cut	No	Fast
`mineru-layoutreader`	LayoutLMv3	Yes	Medium
`olmocr-vlm`	VLM reasoning	Yes	Slow

Stage 4: Recognition¶

File: pipeline/stages/recognition_stage.py

Responsibility: Extract text from each block

What It Does¶

Crops block images from full page using BBox
Sends cropped images to VLM API or local model
Uses block-type-specific prompts for optimal results
Handles special content (tables, figures) with appropriate prompts
Supports Ray-based distributed recognition

Input/Output¶

Input	Output
`list[Block]` + image	`list[Block]` (with `text` field)

Block Data After Recognition¶

Block(
    type="text",
    bbox=BBox(100, 50, 500, 120),
    detection_confidence=0.95,
    order=0,
    text="Chapter 1: Introduction",  # Added by recognizer
    source="doclayout-yolo",
)

Available Recognizers¶

Recognizer	Type	Cost	Speed
`gemini-2.5-flash`	Cloud API	Free tier	Fast
`gpt-4o`	Cloud API	Pay per token	Medium
`paddleocr-vl`	Local	Free	Medium
`deepseek-ocr`	Local	Free	Medium

Stage 5: Block Correction¶

File: pipeline/stages/block_correction_stage.py

Responsibility: Block-level text correction (optional)

What It Does¶

Applies VLM-based correction at individual block level
Currently a placeholder stage (disabled by default)
Copies text to corrected_text when disabled

Input/Output¶

Input	Output
`list[Block]`	`list[Block]` (with `corrected_text` field)

Configuration¶

Enable with --block-correction CLI flag or enable_block_correction=True in config.

Block Data After Correction¶

Block(
    type="text",
    bbox=BBox(100, 50, 500, 120),
    detection_confidence=0.95,
    order=0,
    text="Chapter 1: Introduction",
    corrected_text="Chapter 1: Introduction",  # Added
    source="doclayout-yolo",
)

Stage 6: Rendering¶

File: pipeline/stages/rendering_stage.py

Responsibility: Convert processed blocks to output format

What It Does¶

Assembles blocks in reading order
Generates Markdown or plaintext output
Uses auxiliary info (font sizes) for enhanced formatting
Supports multiple rendering strategies

Input/Output¶

Input	Output
`list[Block]` + auxiliary_info	`str` (Markdown/plaintext)

Rendering Strategies¶

Strategy	Description	Use Case
Block Type-Based	Maps block types to Markdown	Default, fast
Font Size-Based	Uses font sizes for headers	Precise formatting

Example Output¶

# Chapter 1: Introduction

This document describes the VLM OCR Pipeline...

## Key Features

- Feature 1
- Feature 2

Stage 7: Page Correction¶

File: pipeline/stages/page_correction_stage.py

Responsibility: Page-level VLM correction (optional)

What It Does¶

Sends entire page text to VLM for holistic correction
Corrects OCR errors and improves formatting consistency
Calculates correction ratio (how much text changed)
Handles rate limits and returns early if needed

Input/Output¶

Input	Output
`str` (page text)	`PageCorrectionResult`

PageCorrectionResult¶

@dataclass
class PageCorrectionResult:
    corrected_text: str
    correction_ratio: float  # 0.0 = no change, 1.0 = completely different
    should_stop: bool  # True if rate limit hit

Configuration¶

Enable with --page-correction CLI flag
Skipped for local models (PaddleOCR-VL) by default
Controlled by enable_page_correction=True in config

Stage 8: Output¶

File: pipeline/stages/output_stage.py

Responsibility: Save results and generate summaries

What It Does¶

Builds Page objects with all metadata
Saves page results as JSON files
Generates Markdown output files
Creates document-level summaries
Creates final output directory structure

Input/Output¶

Input	Output
All processed data	JSON + Markdown files

Output Structure¶

output/{model}/{document}/
├── page_1.json
├── page_1.md
├── page_2.json
├── page_2.md
└── summary.json

Key Methods¶

class OutputStage:
    def build_page_result(self, ...) -> Page:
        """Build Page object with all metadata."""

    def save_page_output(self, output_dir: Path, page_num: int, page: Page):
        """Save page as JSON and Markdown."""

    def create_pdf_summary(self, ...) -> Document:
        """Create document summary with all pages."""

Stage Data Flow¶

Complete Block Evolution¶

# After Stage 2 (Detection)
Block(type="text", bbox=BBox(...), detection_confidence=0.95, source="...")

# After Stage 3 (Ordering)
Block(type="text", bbox=BBox(...), detection_confidence=0.95, order=0, column_index=0, source="...")

# After Stage 4 (Recognition)
Block(type="text", bbox=BBox(...), detection_confidence=0.95, order=0, text="...", source="...")

# After Stage 5 (Block Correction)
Block(type="text", bbox=BBox(...), detection_confidence=0.95, order=0, text="...", corrected_text="...", source="...")

Stage Timing¶

Typical processing time distribution for a single page:

Stage	Time	% of Total
Input	~0.5s	5%
Detection	~1.0s	10%
Ordering	~0.2s	2%
Recognition	~5.0s	50%
Block Correction	~0.0s	0% (disabled)
Rendering	~0.1s	1%
Page Correction	~3.0s	30%
Output	~0.2s	2%

Pipeline Stages¶

Stage Overview¶

Stage 1: Input¶

What It Does¶

Input/Output¶

Key Methods¶

Configuration¶

Stage 2: Detection¶

What It Does¶

Input/Output¶

Block Data After Detection¶

Available Detectors¶

Stage 3: Ordering¶

What It Does¶

Input/Output¶

Block Data After Ordering¶

Available Sorters¶

Stage 4: Recognition¶

What It Does¶

Input/Output¶

Block Data After Recognition¶

Available Recognizers¶

Stage 5: Block Correction¶

What It Does¶

Input/Output¶

Configuration¶

Block Data After Correction¶

Stage 6: Rendering¶

What It Does¶

Input/Output¶

Rendering Strategies¶

Example Output¶

Stage 7: Page Correction¶

What It Does¶

Input/Output¶

PageCorrectionResult¶

Configuration¶

Stage 8: Output¶

What It Does¶

Input/Output¶

Output Structure¶

Key Methods¶

Stage Data Flow¶

Complete Block Evolution¶

Stage Timing¶

See Also¶