Skip to content

Types API

Core type definitions and data structures used throughout the pipeline.

Overview

The pipeline uses a unified type system for representing document elements:

  • BBox: Pixel-based bounding box with automatic format conversion
  • Block: Document block (text, table, figure, etc.) with metadata
  • Page: Processed page with blocks and text
  • Document: Complete document with multiple pages

BBox

Unified bounding box representation with automatic format conversion between 6+ formats.

Key Features:

  • Internal format: (x0, y0, x1, y1) - integer pixel coordinates (xyxy corners)
  • JSON output: [x, y, w, h] - position + size for human readability
  • Automatic conversion from YOLO, MinerU, PyMuPDF, PyPDF, olmOCR formats

Quick Example

from pipeline.types import BBox

# Create from different formats
bbox = BBox.from_xywh(100, 50, 200, 150)      # [x, y, w, h]
bbox = BBox.from_xyxy(100, 50, 300, 200)      # corners
bbox = BBox.from_pypdf_rect([100, 592, 300, 742], page_height=792)

# Convert to formats
xywh = bbox.to_xywh_list()     # [100, 50, 200, 150]
xyxy = bbox.to_list()          # [100, 50, 300, 200]

# Properties
print(bbox.width, bbox.height)  # 200, 150
print(bbox.area)                # 30000
print(bbox.center)              # (200.0, 125.0)

# Operations
cropped = bbox.crop(image, padding=5)
iou = bbox1.iou(bbox2)

BBox Class Reference

@dataclass(frozen=True)
class BBox:
    """Pixel-based bounding box with integer coordinates.

    Internal format: (x0, y0, x1, y1) - Top-left and bottom-right corners (xyxy)
    Origin: Top-left corner of image (0, 0)
    Coordinates: Integer pixel values (not normalized)
    """

    x0: int  # Left
    y0: int  # Top
    x1: int  # Right
    y1: int  # Bottom

Factory Methods

Method Format Example
from_xywh(x, y, w, h) Position + Size BBox.from_xywh(100, 50, 200, 150)
from_xyxy(x0, y0, x1, y1) Corners BBox.from_xyxy(100, 50, 300, 200)
from_list(coords, format) List BBox.from_list([100, 50, 200, 150], "xywh")
from_pymupdf_rect(rect) PyMuPDF Rect BBox.from_pymupdf_rect(rect)
from_mineru_bbox(bbox) MinerU list BBox.from_mineru_bbox([100, 50, 300, 200])
from_pypdf_rect(rect, height) PyPDF (Y-flip) BBox.from_pypdf_rect([100, 592, 300, 742], 792)
from_cxcywh(cx, cy, w, h) Center format BBox.from_cxcywh(200, 125, 200, 150)

Conversion Methods

Method Output Example
to_xywh() Tuple (100, 50, 200, 150)
to_xyxy() Tuple (100, 50, 300, 200)
to_xywh_list() List (JSON) [100, 50, 200, 150]
to_list() List (xyxy) [100, 50, 300, 200]
to_dict() Dict {"x0": 100, "y0": 50, "x1": 300, "y1": 200}
to_mineru_bbox() List [100, 50, 300, 200]
to_pypdf_rect(height) List (Y-flip) [100, 592, 300, 742]
to_olmocr_anchor(type) String "[100x50]text"

Properties

Property Type Description
width int Width of bbox
height int Height of bbox
area int Area (width × height)
center tuple[float, float] Center point (cx, cy)
left, top, right, bottom int Edge coordinates

Geometric Operations

Method Description Returns
intersect(other) Intersection area int
iou(other) Intersection over Union float (0-1)
overlap_ratio(other) Overlap relative to self float (0-1)
contains_point(x, y) Point inside bbox? bool
expand(padding) Expand by padding BBox
clip(max_w, max_h) Clip to bounds BBox
crop(image, padding) Crop from image np.ndarray

Block

Document block representing a detected layout element.

Quick Example

from pipeline.types import Block, BBox

# Create block
block = Block(
    type="text",
    bbox=BBox(100, 50, 500, 200),
    detection_confidence=0.95,
    order=0,
    text="Extracted text content",
)

# Serialize to JSON
data = block.to_dict()
# {"order": 0, "type": "text", "xywh": [100, 50, 400, 150], ...}

# Deserialize from JSON
block = Block.from_dict(data)

Block Class Reference

@dataclass
class Block:
    """Document block with bounding box."""

    # Core fields (always present)
    type: str              # "text", "title", "table", "image", etc.
    bbox: BBox             # Required, always present
    detection_confidence: float | None = None

    # Added by sorters
    order: int | None = None           # Reading order rank
    column_index: int | None = None    # Column index

    # Added by recognizers
    text: str | None = None            # Extracted text
    corrected_text: str | None = None  # VLM-corrected text
    correction_ratio: float | None = None
    corrected_by: str | None = None

    # Image/figure blocks
    image_path: str | None = None      # Path to extracted image
    description: str | None = None     # VLM description

    # Metadata (internal use)
    source: str | None = None          # Detector name
    index: int | None = None           # Internal index

Methods

Method Description
to_dict() Convert to JSON-serializable dict (bbox → xywh)
from_dict(data) Create Block from dict (xywh → BBox)

BlockType

Standardized block type constants for consistent processing.

class BlockType:
    # Content
    TEXT = "text"
    TITLE = "title"

    # Figures
    IMAGE = "image"
    IMAGE_BODY = "image_body"
    IMAGE_CAPTION = "image_caption"

    # Tables
    TABLE = "table"
    TABLE_BODY = "table_body"
    TABLE_CAPTION = "table_caption"

    # Equations
    EQUATION = "equation"
    INTERLINE_EQUATION = "interline_equation"
    INLINE_EQUATION = "inline_equation"

    # Code
    CODE = "code"
    ALGORITHM = "algorithm"

    # Lists
    LIST = "list"

    # Page Elements
    HEADER = "header"
    FOOTER = "footer"
    PAGE_NUMBER = "page_number"

Protocol Interfaces

Detector Protocol

All detectors must implement this interface:

from typing import Protocol
import numpy as np
from pipeline.types import Block

class Detector(Protocol):
    """Protocol for layout detectors."""

    def detect(self, image: np.ndarray) -> list[Block]:
        """Detect layout blocks in image.

        Args:
            image: Input image as numpy array (H, W, C)

        Returns:
            List of detected blocks
        """
        ...

Sorter Protocol

All sorters must implement this interface:

class Sorter(Protocol):
    """Protocol for reading order sorters."""

    def sort(
        self,
        blocks: list[Block],
        image: np.ndarray,
        **kwargs,
    ) -> list[Block]:
        """Sort blocks in reading order.

        Args:
            blocks: Detected blocks
            image: Original page image
            **kwargs: Additional arguments

        Returns:
            Sorted blocks with order field
        """
        ...

Recognizer Protocol

All recognizers must implement this interface:

class Recognizer(Protocol):
    """Protocol for text recognizers."""

    def process_blocks(
        self,
        image: np.ndarray,
        blocks: Sequence[Block],
    ) -> list[Block]:
        """Extract text from blocks.

        Args:
            image: Full page image
            blocks: Blocks to process

        Returns:
            Blocks with text field
        """
        ...

    def correct_text(self, text: str) -> str | dict[str, Any]:
        """Correct extracted text.

        Args:
            text: Raw extracted text

        Returns:
            Corrected text or dict with metadata
        """
        ...

Page and Document

Page

@dataclass
class Page:
    """Processed page with all metadata."""

    page_num: int
    width: int
    height: int
    blocks: list[Block]
    text: str                      # Rendered text
    corrected_text: str | None     # Page-level corrected
    correction_ratio: float | None
    processing_time_seconds: float
    processed_at: str              # ISO timestamp
    status: str                    # "complete", "partial", "error"

Document

@dataclass
class Document:
    """Complete document with all pages."""

    name: str
    path: Path
    num_pages: int
    processed_pages: int
    pages: list[Page]
    output_directory: Path
    processed_at: str
    status_summary: dict[str, int]

See Also