BBox Format Reference¶
Coordinate Systems¶
All frameworks use Top-Left origin (0,0) except PyPDF.
Format Comparison¶
| Framework | Format | Example | Notes |
|---|---|---|---|
| This Project (Internal) | BBox(x0, y0, x1, y1) |
BBox(100, 50, 300, 200) |
Integer coordinates, xyxy corners |
| This Project (JSON) | [x, y, w, h] |
[100, 50, 200, 150] |
Position + Size (human-readable) |
| YOLO | [x1, y1, x2, y2] |
[100, 50, 300, 200] |
Top-Left + Bottom-Right |
| MinerU | [x0, y0, x1, y1] |
[100, 50, 300, 200] |
Top-Left + Bottom-Right |
| PyMuPDF | Rect(x0, y0, x1, y1) |
Rect(100, 50, 300, 200) |
Top-Left + Bottom-Right |
| PyPDF ⚠️ | [x0, y0, x1, y1] |
[100, 592, 300, 742] |
Bottom-Left origin |
| olmOCR | "[x, y]text" |
"[100x50]Chapter 1" |
Text format |
Visual Example¶
Same rectangle at visual position (100, 50) with size 200×150 on page height 792:
Computer Vision (Top-Left origin):
┌─────────────────┐
│ (0,0) │
│ │
│ [100,50] │ ← Region here
│ ┌────────┐ │
│ │ │ │
│ └────────┘ │
│ [300,200]
└─────────────────┘
PDF (Bottom-Left origin):
┌─────────────────┐
│ [300,742] ← Y is flipped!
│ ┌────────┐ │
│ │ │ │
│ [100,592] │ ← Same region
│ │
│ │
└─────────────────┘
(0,0)
Format Conversions¶
This Project ↔ MinerU/YOLO¶
# [x, y, w, h] → [x0, y0, x1, y1]
x0, y0, x1, y1 = x, y, x + w, y + h
# [x0, y0, x1, y1] → [x, y, w, h]
x, y, w, h = x0, y0, x1 - x0, y1 - y0
PyPDF ↔ This Project (Y-axis flip required!)¶
# PyPDF → This Project
y_top = page_height - y1_pypdf # 792 - 742 = 50
y_bottom = page_height - y0_pypdf # 792 - 592 = 200
# This Project → PyPDF
y0_pypdf = page_height - y_bottom # 792 - 200 = 592
y1_pypdf = page_height - y_top # 792 - 50 = 742
olmOCR Anchor Format¶
# Text region
anchor = f"[{x:.0f}x{y:.0f}]{text_content}"
# Example: "[100x50]Chapter 1"
# Image/Figure region
anchor = f"[Image {x0:.0f}x{y0:.0f} to {x1:.0f}x{y1:.0f}]"
# Example: "[Image 100x50 to 300x200]"
# Table region
anchor = f"[Table {x0:.0f}x{y0:.0f} to {x1:.0f}x{y1:.0f}]"
# Example: "[Table 100x450 to 500x600]"
BBox Class Usage¶
Our BBox class handles all conversions automatically with integer coordinates:
from pipeline.types import BBox
# Create from any format (accepts float, converts to int)
bbox = BBox.from_xywh(100, 50, 200, 150) # xywh format
bbox = BBox.from_xyxy(100, 50, 300, 200) # xyxy format (corners)
bbox = BBox.from_cxcywh(200, 125, 200, 150) # Center format (YOLO training)
bbox = BBox.from_mineru_bbox([100, 50, 300, 200])
bbox = BBox.from_pymupdf_rect(rect)
bbox = BBox.from_pypdf_rect([100, 592, 300, 742], page_height=792)
# Convert to any format
coords = bbox.to_xywh_list() # [100, 50, 200, 150] (for JSON)
coords = bbox.to_list() # [100, 50, 300, 200] (xyxy)
coords = bbox.to_dict() # {"x0": 100, "y0": 50, "x1": 300, "y1": 200}
cx, cy, w, h = bbox.to_cxcywh() # Center format
coords = bbox.to_mineru_bbox() # [100, 50, 300, 200]
coords = bbox.to_pypdf_rect(792) # [100, 592, 300, 742]
anchor = bbox.to_olmocr_anchor("image") # "[Image 100x50 to 300x200]"
# Geometric operations (integer results)
center_x, center_y = bbox.center # (float, float)
area = bbox.area # int
width = bbox.width # int
height = bbox.height # int
overlap = bbox1.intersect(bbox2) # int
iou = bbox1.iou(bbox2) # float
# Clear aliases
left = bbox.left # = bbox.x0
top = bbox.top # = bbox.y0
right = bbox.right # = bbox.x1
bottom = bbox.bottom # = bbox.y1
# NumPy convenience
cropped = bbox.crop(image, padding=5) # Direct image cropping
Region Structure¶
The Region dataclass represents detected document regions:
from pipeline.types import Block, BBox
block = Block(
type="text",
bbox=BBox(100, 50, 300, 200), # Required, always present
detection_confidence=0.95,
# Optional fields
order=0,
column_index=1,
text="Extracted text...",
)
# Serialize to JSON (bbox → xywh format for readability)
data = block.to_dict()
# {"order": 0, "type": "text", "xywh": [100, 50, 200, 150], "detection_confidence": 0.95, ...}
# Deserialize from JSON (supports xywh)
block = Block.from_dict(data)
Key Points¶
- Internal representation:
BBox(x0, y0, x1, y1)with integer coordinates (xyxy format) - JSON output:
[x, y, w, h](xywh format) for human readability - Most frameworks use the same internal format:
[x0, y0, x1, y1]with Top-Left origin - Only PyPDF is different: Bottom-Left origin requires Y-axis flip
- Region.bbox is always present: No longer optional (required field)