Sorters¶
Reading order analysis algorithms for document layout.
Overview¶
Sorters analyze the reading order of detected blocks, determining the sequence in which text should be read. This is crucial for multi-column documents and complex layouts.
Sorter Comparison¶
| Sorter | Algorithm | Multi-Column | Speed | Quality |
|---|---|---|---|---|
pymupdf |
Font analysis | ✅ | ⚡⚡⚡ | ⭐⭐⭐ |
mineru-xycut |
XY-Cut | ❌ | ⚡⚡⚡ | ⭐⭐ |
mineru-layoutreader |
LayoutLMv3 | ✅ | ⚡⚡ | ⭐⭐⭐⭐ |
mineru-vlm |
VLM reasoning | ✅ | ⚡ | ⭐⭐⭐⭐ |
olmocr-vlm |
VLM reasoning | ✅ | ⚡ | ⭐⭐⭐⭐ |
paddleocr-doclayout-v2 |
Pointer network | ✅ | ⚡⚡ | ⭐⭐⭐⭐ |
PyMuPDF Sorter¶
Algorithm: Font-based column analysis | Speed: Fast
Multi-column aware sorter using PyMuPDF's font and position analysis.
How It Works¶
- Analyzes font sizes and positions in the original PDF
- Detects column boundaries based on text block positions
- Groups blocks by column
- Sorts within each column (top-to-bottom)
- Assigns
column_indexto each block
Output Fields¶
| Field | Description |
|---|---|
order |
Global reading order (0-indexed) |
column_index |
Column assignment (0 = left, 1 = right, etc.) |
Usage¶
from pipeline.layout.ordering import create_sorter
sorter = create_sorter("pymupdf")
sorted_blocks = sorter.sort(blocks, image, pymupdf_page=page)
CLI¶
Best For¶
- Standard multi-column documents (newspapers, journals)
- Documents with consistent column layouts
XY-Cut Sorter (MinerU)¶
Algorithm: Recursive XY-Cut | Speed: Fast
Classic recursive XY-Cut algorithm that divides the page into regions.
How It Works¶
1. Find horizontal whitespace → split into rows
2. Find vertical whitespace → split into columns
3. Recursively apply until blocks are isolated
4. Order blocks by position
graph TD
A[Page] --> B[Split Horizontally]
B --> C[Top Half]
B --> D[Bottom Half]
C --> E[Split Vertically]
E --> F[Left Block]
E --> G[Right Block]
Output Fields¶
| Field | Description |
|---|---|
order |
Reading order (0-indexed) |
Usage¶
CLI¶
Best For¶
- Simple layouts with clear horizontal/vertical separations
- Single-column documents
- Fast processing requirements
Limitations¶
- May fail on complex multi-column layouts
- No column detection
LayoutReader Sorter (MinerU)¶
Algorithm: LayoutLMv3 | Speed: Medium
Machine learning-based sorter using LayoutLMv3 for reading order prediction.
How It Works¶
- Encodes block positions and types
- Uses LayoutLMv3 transformer to predict pairwise ordering
- Constructs global order from pairwise comparisons
Output Fields¶
| Field | Description |
|---|---|
order |
ML-predicted reading order |
Usage¶
CLI¶
Requirements¶
- MinerU 2.5+ (in
external/MinerU/) - ~2GB GPU memory
Best For¶
- Complex layouts
- Academic papers
- Multi-column with figures/tables
VLM Sorters¶
MinerU VLM Sorter¶
Algorithm: Vision Language Model | Speed: Slow
Uses MinerU's VLM for visual reasoning about reading order.
Coupling Requirement
The mineru-vlm sorter requires the mineru-vlm detector. They share the same VLM inference.
olmOCR VLM Sorter¶
Algorithm: Vision Language Model | Speed: Slow
Uses olmOCR's VLM for reading order analysis. More flexible than MinerU VLM.
CLI¶
Best For¶
- Very complex layouts
- Documents with unusual reading patterns
- When accuracy is more important than speed
PaddleOCR DocLayoutV2 Sorter¶
Algorithm: Pointer Network | Speed: Medium
Preserves the reading order from PaddleOCR's PP-DocLayoutV2 detector, which includes a built-in pointer network for ordering.
How It Works¶
- PP-DocLayoutV2 detector outputs blocks with order information
- Sorter preserves this ordering (passthrough)
- No additional computation required
Usage¶
CLI¶
python main.py --input doc.pdf \
--detector paddleocr-doclayout-v2 \
--sorter paddleocr-doclayout-v2
Best For¶
- When using PaddleOCR PP-DocLayoutV2 detector
- Efficient processing (no separate sorting step)
Detector/Sorter Compatibility¶
| Detector | Compatible Sorters |
|---|---|
doclayout-yolo |
All except mineru-vlm |
mineru-doclayout-yolo |
All except mineru-vlm |
paddleocr-doclayout-v2 |
All |
mineru-vlm |
mineru-vlm only |
# Validate compatibility
from pipeline.layout.ordering import validate_combination
validate_combination("doclayout-yolo", "mineru-xycut") # OK
validate_combination("doclayout-yolo", "mineru-vlm") # Raises error
Choosing a Sorter¶
Recommended Configurations¶
| Use Case | Sorter | Reason |
|---|---|---|
| Fast processing | mineru-xycut |
Fastest, simple algorithm |
| Multi-column docs | pymupdf |
Good column detection |
| Complex layouts | mineru-layoutreader |
ML-based, accurate |
| PaddleOCR pipeline | paddleocr-doclayout-v2 |
Built-in ordering |
| Maximum accuracy | olmocr-vlm |
VLM reasoning |
Decision Matrix¶
graph TD
A[Multi-column document?] -->|No| B[mineru-xycut]
A -->|Yes| C[Using PaddleOCR detector?]
C -->|Yes| D[paddleocr-doclayout-v2]
C -->|No| E[Need high accuracy?]
E -->|No| F[pymupdf]
E -->|Yes| G[mineru-layoutreader]
Performance Benchmarks¶
Tested on NVIDIA RTX 3090:
| Sorter | Blocks/sec | Accuracy | Notes |
|---|---|---|---|
mineru-xycut |
~1000 | 85% | Fastest |
pymupdf |
~500 | 90% | Good for columns |
paddleocr-doclayout-v2 |
~300 | 92% | With PP-DocLayoutV2 |
mineru-layoutreader |
~100 | 95% | ML-based |
olmocr-vlm |
~20 | 97% | VLM-based |
See Also¶
- Ordering API - API reference
- Detectors - Layout detection models
- Architecture Overview - Pipeline design