Error Handling Guidelines¶
This document defines the error handling policy for the VLM OCR Pipeline project.
Table of Contents¶
- Error Handling Guidelines
- Table of Contents
- 1. Custom Exception Hierarchy
- 2. When to Use Each Exception
- 3. Exception Handling Best Practices
- 4. Error Logging Standards
- 5. Error Recovery Strategies
- 6. Testing Error Handling
- 7. Migration Guide
1. Custom Exception Hierarchy¶
All custom exceptions inherit from PipelineError and are defined in pipeline/exceptions.py:
PipelineError (base)
├── ConfigurationError
│ ├── InvalidConfigError
│ └── MissingConfigError
├── APIError
│ ├── APIClientError
│ ├── APIAuthenticationError
│ ├── APIRateLimitError
│ └── APITimeoutError
├── ProcessingError
│ ├── PageProcessingError
│ ├── DetectionError
│ ├── RecognitionError
│ └── RenderingError
├── FileError
│ ├── FileLoadError
│ ├── FileSaveError
│ └── FileFormatError
└── DependencyError
Import all exceptions from:
from pipeline.exceptions import (
PipelineError,
ConfigurationError,
InvalidConfigError,
MissingConfigError,
APIError,
APIClientError,
APIAuthenticationError,
APIRateLimitError,
APITimeoutError,
ProcessingError,
PageProcessingError,
DetectionError,
RecognitionError,
RenderingError,
FileError,
FileLoadError,
FileSaveError,
FileFormatError,
DependencyError,
)
2. When to Use Each Exception¶
ConfigurationError¶
Use when dealing with configuration files, settings, or initialization parameters.
InvalidConfigError:
# Example: Invalid tier name
if tier not in ["free", "tier1", "tier2", "tier3"]:
raise InvalidConfigError(f"Invalid tier: {tier}. Must be one of: free, tier1, tier2, tier3")
# Example: Malformed YAML
try:
config = yaml.safe_load(f)
except yaml.YAMLError as e:
raise InvalidConfigError(f"Malformed YAML in {config_file}: {e}") from e
MissingConfigError:
# Example: Missing API key
if not api_key:
raise MissingConfigError("OpenAI API key not found. Set OPENAI_API_KEY environment variable.")
# Example: Required config file not found
if not config_file.exists():
raise MissingConfigError(f"Configuration file not found: {config_file}")
APIError¶
Use when interacting with external APIs (OpenAI, Gemini, PaddleOCR-VL, etc.).
APIClientError:
# Example: Client initialization failure
try:
self.client = OpenAI(api_key=api_key, base_url=base_url)
except TypeError as e:
raise APIClientError(f"Failed to initialize OpenAI client: {e}") from e
APIAuthenticationError:
# Example: Invalid API key (from external library)
try:
response = self.client.chat.completions.create(...)
except AuthenticationError as e:
raise APIAuthenticationError(f"OpenAI authentication failed: {e}") from e
APIRateLimitError:
# Example: Rate limit exceeded
try:
response = self.client.chat.completions.create(...)
except RateLimitError as e:
raise APIRateLimitError(f"OpenAI rate limit exceeded: {e}") from e
APITimeoutError:
# Example: Request timeout
try:
response = requests.get(url, timeout=30)
except requests.Timeout as e:
raise APITimeoutError(f"API request timed out: {e}") from e
ProcessingError¶
Use when processing documents through the pipeline stages.
PageProcessingError:
# Example: Page rendering failure
try:
page_image = render_pdf_page(pdf_path, page_num)
except Exception as e:
raise PageProcessingError(f"Failed to process page {page_num}: {e}") from e
DetectionError:
# Example: Layout detection failure
try:
blocks = self.detector.detect(page_image)
except Exception as e:
raise DetectionError(f"Layout detection failed: {e}") from e
RecognitionError:
# Example: Text recognition failure
try:
text = self.recognizer.extract_text(block_image)
except Exception as e:
raise RecognitionError(f"Text recognition failed for block: {e}") from e
RenderingError:
# Example: Markdown conversion error
try:
markdown = block_to_markdown(block)
except Exception as e:
raise RenderingError(f"Failed to render block to markdown: {e}") from e
FileError¶
Use for file I/O operations.
FileLoadError:
# Example: File not found
if not file_path.exists():
raise FileLoadError(f"File not found: {file_path}")
# Example: PDF loading failure
try:
images = convert_from_path(str(pdf_path))
except Exception as e:
raise FileLoadError(f"Failed to load PDF: {e}") from e
FileSaveError:
# Example: Permission denied
try:
with open(output_file, "w") as f:
json.dump(data, f)
except PermissionError as e:
raise FileSaveError(f"Permission denied writing to {output_file}: {e}") from e
# Example: Disk full
except OSError as e:
raise FileSaveError(f"Failed to save file {output_file}: {e}") from e
FileFormatError:
# Example: Invalid PDF
if not is_valid_pdf(file_path):
raise FileFormatError(f"Invalid PDF file: {file_path}")
# Example: Unsupported image format
if file_path.suffix not in [".png", ".jpg", ".jpeg"]:
raise FileFormatError(f"Unsupported image format: {file_path.suffix}")
DependencyError¶
Use when optional dependencies are missing or incompatible.
# Example: Missing optional dependency
try:
import fitz # PyMuPDF
except ImportError as e:
raise DependencyError("PyMuPDF is required for multi-column detection. Install with: uv pip install pymupdf") from e
# Example: Incompatible version
if version.parse(mineru.__version__) < version.parse("0.8.0"):
raise DependencyError(f"MinerU version {mineru.__version__} is not supported. Requires >= 0.8.0")
3. Exception Handling Best Practices¶
3.1. Catch Specific Exceptions¶
✅ Good:
try:
config = yaml.safe_load(f)
except yaml.YAMLError as e:
raise InvalidConfigError(f"Malformed YAML: {e}") from e
except OSError as e:
raise FileLoadError(f"Failed to read config file: {e}") from e
❌ Bad:
3.2. Exception Chaining¶
Always use from e to preserve the original exception context:
✅ Good:
try:
response = self.client.chat.completions.create(...)
except AuthenticationError as e:
raise APIAuthenticationError(f"Authentication failed: {e}") from e
❌ Bad:
try:
response = self.client.chat.completions.create(...)
except AuthenticationError as e:
raise APIAuthenticationError(f"Authentication failed: {e}") # Lost context!
3.3. When to Use Broad Exception Handlers¶
Broad exception handlers (except Exception) are only allowed in these cases:
-
Top-level CLI entry points (with
# noqa: BLE001comment): -
Optional dependency guards (with
# pragma: no covercomment): -
Fallback for unexpected errors (after catching specific errors):
try: response = self.client.chat.completions.create(...) except AuthenticationError as e: raise APIAuthenticationError(f"Authentication failed: {e}") from e except RateLimitError as e: raise APIRateLimitError(f"Rate limit exceeded: {e}") from e except Exception as e: # Fallback for unexpected errors (document why!) logger.error("Unexpected API error: %s", e) return {"error": "api_error", "message": str(e)}
3.4. Re-raising Exceptions¶
When you want to log an error but still propagate it:
try:
page_result = self._process_pdf_page(pdf_path, page_num)
except PageProcessingError as e:
logger.error("Page %d processing failed: %s", page_num, e)
raise # Re-raise the same exception
4. Error Logging Standards¶
4.1. Logging Levels¶
Use appropriate logging levels:
| Level | When to Use | Example |
|---|---|---|
DEBUG |
Detailed diagnostic information | logger.debug("Processing block %d of %d", i, total) |
INFO |
General informational messages | logger.info("Loaded %d pages from %s", len(pages), pdf_path) |
WARNING |
Recoverable errors, fallback used | logger.warning("PyMuPDF not available, using basic sorter") |
ERROR |
Serious errors, operation failed | logger.error("Failed to process page %d: %s", page_num, e) |
CRITICAL |
Critical errors, system cannot continue | logger.critical("API key not found, cannot proceed") |
4.2. Log Message Format¶
Format: "Action failed: %s", error_details
✅ Good:
logger.error("Failed to load config file %s: %s", config_path, e)
logger.warning("Rate limit reached. Waiting %.2f seconds...", wait_time)
logger.info("Processed %d pages in %.2f seconds", page_count, elapsed)
❌ Bad:
logger.error(f"Failed to load config file {config_path}: {e}") # Don't use f-strings!
logger.error("Error!") # Not descriptive!
logger.error(str(e)) # Missing context!
Why avoid f-strings in logging?
- f-strings are evaluated before the log level check (performance overhead)
- %s formatting is only evaluated if the log level is enabled
- Better for structured logging
4.3. Exception Stack Traces¶
Use exc_info=True to include full stack trace:
✅ Good:
try:
result = process_page(page_num)
except PageProcessingError as e:
logger.error("Error processing page %d: %s", page_num, e, exc_info=True)
When to use exc_info=True?
- For unexpected errors at top-level handlers
- When debugging complex issues
- For errors that should never happen
When NOT to use exc_info=True?
- For expected errors (rate limits, missing files)
- For informational warnings
- When stack trace would be too verbose
5. Error Recovery Strategies¶
5.1. Graceful Degradation¶
Continue processing with reduced functionality:
try:
import fitz # PyMuPDF
except ImportError:
logger.warning("PyMuPDF not available. Multi-column detection disabled.")
fitz = None
# Later in code
if fitz is not None:
# Use advanced multi-column detection
layout = detect_multi_column_layout(page)
else:
# Fallback to basic processing
layout = None
5.2. Retry Logic¶
Retry on transient failures:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10),
reraise=True
)
def api_call_with_retry():
try:
return self.client.chat.completions.create(...)
except APITimeoutError:
logger.warning("API timeout, retrying...")
raise
5.3. Fallback Values¶
Return safe defaults on error:
def load_config(config_path: Path) -> dict[str, Any]:
"""Load configuration file with fallback to empty dict."""
try:
with open(config_path) as f:
return yaml.safe_load(f)
except FileNotFoundError:
logger.debug("Config file not found: %s", config_path)
return {}
except yaml.YAMLError as e:
logger.warning("Failed to parse config file %s: %s", config_path, e)
return {}
6. Testing Error Handling¶
Always test error paths:
def test_invalid_tier_raises_error():
"""Test that InvalidConfigError is raised for invalid tier."""
with pytest.raises(InvalidConfigError, match="Invalid tier"):
rate_limiter.set_tier_and_model("invalid_tier", "gemini-2.5-flash")
def test_missing_api_key_raises_error():
"""Test that MissingConfigError is raised when API key is missing."""
with pytest.raises(MissingConfigError, match="API key not found"):
OpenAIClient(api_key=None)
def test_file_not_found_raises_error():
"""Test that FileLoadError is raised for non-existent file."""
with pytest.raises(FileLoadError, match="File not found"):
load_pdf("/nonexistent/file.pdf")
7. Migration Guide¶
When migrating from except Exception to specific exceptions:
Step 1: Identify the error type¶
Look at what can go wrong in the try block:
# Before
try:
config = yaml.safe_load(f)
except Exception as e:
logger.error("Error: %s", e)
return {}
Step 2: Choose the appropriate custom exception¶
- YAML parsing error →
InvalidConfigError - File I/O error →
FileLoadError
Step 3: Add proper error context¶
# After
try:
config = yaml.safe_load(f)
except yaml.YAMLError as e:
raise InvalidConfigError(f"Malformed YAML in {config_file}: {e}") from e
except OSError as e:
raise FileLoadError(f"Failed to read config file {config_file}: {e}") from e
Step 4: Update logging¶
# Caller code
try:
config = load_config(config_path)
except InvalidConfigError as e:
logger.error("Configuration error: %s", e)
return {}
except FileLoadError as e:
logger.warning("Config file not found: %s", e)
return {}
Last Updated: 2025-01-26
See Also:
- pipeline/exceptions.py - Custom exception definitions
- CLAUDE.md - Project coding standards
- .cursorrules - Detailed coding guidelines