Error Handling Guidelines¶

This document defines the error handling policy for the VLM OCR Pipeline project.

Table of Contents¶

Error Handling Guidelines
Table of Contents
1. Custom Exception Hierarchy
2. When to Use Each Exception
3. Exception Handling Best Practices
4. Error Logging Standards
5. Error Recovery Strategies
6. Testing Error Handling
7. Migration Guide

1. Custom Exception Hierarchy¶

All custom exceptions inherit from PipelineError and are defined in pipeline/exceptions.py:

PipelineError (base)
├── ConfigurationError
│   ├── InvalidConfigError
│   └── MissingConfigError
├── APIError
│   ├── APIClientError
│   ├── APIAuthenticationError
│   ├── APIRateLimitError
│   └── APITimeoutError
├── ProcessingError
│   ├── PageProcessingError
│   ├── DetectionError
│   ├── RecognitionError
│   └── RenderingError
├── FileError
│   ├── FileLoadError
│   ├── FileSaveError
│   └── FileFormatError
└── DependencyError

Import all exceptions from:



name="__codelineno-1-1" href="#__codelineno-1-1">from pipeline.exceptions import ( PipelineError, ConfigurationError, InvalidConfigError, MissingConfigError, APIError, APIClientError, APIAuthenticationError, APIRateLimitError, APITimeoutError, ProcessingError, PageProcessingError, DetectionError, RecognitionError, RenderingError, FileError, FileLoadError, FileSaveError, FileFormatError, DependencyError, class="p">)

2. When to Use Each Exception¶
ConfigurationError¶
Use when dealing with configuration files, settings, or initialization parameters.
InvalidConfigError:
# Example: Invalid tier name
if tier not in ["free", "tier1", "tier2", "tier3"]:
    raise InvalidConfigError(f"Invalid tier: {tier}. Must be one of: free, tier1, tier2, tier3")

# Example: Malformed YAML
try:
    config = yaml.safe_load(f)
except yaml.YAMLError as e:
    raise InvalidConfigError(f"Malformed YAML in {config_file}: {e}") from e

MissingConfigError:
# Example: Missing API key
if not api_key:
    raise MissingConfigError("OpenAI API key not found. Set OPENAI_API_KEY environment variable.")

# Example: Required config file not found
if not config_file.exists():
    raise MissingConfigError(f"Configuration file not found: {config_file}")

APIError¶
Use when interacting with external APIs (OpenAI, Gemini, PaddleOCR-VL, etc.).
APIClientError:
# Example: Client initialization failure
try:
    self.client = OpenAI(api_key=api_key, base_url=base_url)
except TypeError as e:
    raise APIClientError(f"Failed to initialize OpenAI client: {e}") from e

APIAuthenticationError:
# Example: Invalid API key (from external library)
try:
    response = self.client.chat.completions.create(...)
except AuthenticationError as e:
    raise APIAuthenticationError(f"OpenAI authentication failed: {e}") from e

APIRateLimitError:
# Example: Rate limit exceeded
try:
    response = self.client.chat.completions.create(...)
except RateLimitError as e:
    raise APIRateLimitError(f"OpenAI rate limit exceeded: {e}") from e

APITimeoutError:
# Example: Request timeout
try:
    response = requests.get(url, timeout=30)
except requests.Timeout as e:
    raise APITimeoutError(f"API request timed out: {e}") from e

ProcessingError¶
Use when processing documents through the pipeline stages.
PageProcessingError:
# Example: Page rendering failure
try:
    page_image = render_pdf_page(pdf_path, page_num)
except Exception as e:
    raise PageProcessingError(f"Failed to process page {page_num}: {e}") from e

DetectionError:
# Example: Layout detection failure
try:
    blocks = self.detector.detect(page_image)
except Exception as e:
    raise DetectionError(f"Layout detection failed: {e}") from e

RecognitionError:
# Example: Text recognition failure
try:
    text = self.recognizer.extract_text(block_image)
except Exception as e:
    raise RecognitionError(f"Text recognition failed for block: {e}") from e

RenderingError:
# Example: Markdown conversion error
try:
    markdown = block_to_markdown(block)
except Exception as e:
    raise RenderingError(f"Failed to render block to markdown: {e}") from e

FileError¶
Use for file I/O operations.
FileLoadError:
# Example: File not found
if not file_path.exists():
    raise FileLoadError(f"File not found: {file_path}")

# Example: PDF loading failure
try:
    images = convert_from_path(str(pdf_path))
except Exception as e:
    raise FileLoadError(f"Failed to load PDF: {e}") from e

FileSaveError:
# Example: Permission denied
try:
    with open(output_file, "w") as f:
        json.dump(data, f)
except PermissionError as e:
    raise FileSaveError(f"Permission denied writing to {output_file}: {e}") from e

# Example: Disk full
except OSError as e:
    raise FileSaveError(f"Failed to save file {output_file}: {e}") from e

FileFormatError:
# Example: Invalid PDF
if not is_valid_pdf(file_path):
    raise FileFormatError(f"Invalid PDF file: {file_path}")

# Example: Unsupported image format
if file_path.suffix not in [".png", ".jpg", ".jpeg"]:
    raise FileFormatError(f"Unsupported image format: {file_path.suffix}")

DependencyError¶
Use when optional dependencies are missing or incompatible.
# Example: Missing optional dependency
try:
    import fitz  # PyMuPDF
except ImportError as e:
    raise DependencyError("PyMuPDF is required for multi-column detection. Install with: uv pip install pymupdf") from e

# Example: Incompatible version
if version.parse(mineru.__version__) < version.parse("0.8.0"):
    raise DependencyError(f"MinerU version {mineru.__version__} is not supported. Requires >= 0.8.0")


3. Exception Handling Best Practices¶
3.1. Catch Specific Exceptions¶
✅ Good:
try:
    config = yaml.safe_load(f)
except yaml.YAMLError as e:
    raise InvalidConfigError(f"Malformed YAML: {e}") from e
except OSError as e:
    raise FileLoadError(f"Failed to read config file: {e}") from e

❌ Bad:
try:
    config = yaml.safe_load(f)
except Exception as e:  # Too broad!
    logger.error("Error: %s", e)

3.2. Exception Chaining¶
Always use from e to preserve the original exception context:
✅ Good:
try:
    response = self.client.chat.completions.create(...)
except AuthenticationError as e:
    raise APIAuthenticationError(f"Authentication failed: {e}") from e

❌ Bad:
try:
    response = self.client.chat.completions.create(...)
except AuthenticationError as e:
    raise APIAuthenticationError(f"Authentication failed: {e}")  # Lost context!

3.3. When to Use Broad Exception Handlers¶
Broad exception handlers (except Exception) are only allowed in these cases:


Top-level CLI entry points (with # noqa: BLE001 comment):
   
# main.py
try:
    result = pipeline.process(pdf_path)
except Exception as exc:  # noqa: BLE001 - retain broad logging for CLI
    logger.error("Unexpected error: %s", exc, exc_info=True)
    return 1



Optional dependency guards (with # pragma: no cover comment):
   
try:
    import fitz
except Exception:  # pragma: no cover - optional dependency guard
    fitz = None



Fallback for unexpected errors (after catching specific errors):
   
try:
    response = self.client.chat.completions.create(...)
except AuthenticationError as e:
    raise APIAuthenticationError(f"Authentication failed: {e}") from e
except RateLimitError as e:
    raise APIRateLimitError(f"Rate limit exceeded: {e}") from e
except Exception as e:
    # Fallback for unexpected errors (document why!)
    logger.error("Unexpected API error: %s", e)
    return {"error": "api_error", "message": str(e)}



3.4. Re-raising Exceptions¶
When you want to log an error but still propagate it:
try:
    page_result = self._process_pdf_page(pdf_path, page_num)
except PageProcessingError as e:
    logger.error("Page %d processing failed: %s", page_num, e)
    raise  # Re-raise the same exception


4. Error Logging Standards¶
4.1. Logging Levels¶
Use appropriate logging levels:



Level
When to Use
Example




DEBUG
Detailed diagnostic information
logger.debug("Processing block %d of %d", i, total)


INFO
General informational messages
logger.info("Loaded %d pages from %s", len(pages), pdf_path)


WARNING
Recoverable errors, fallback used
logger.warning("PyMuPDF not available, using basic sorter")


ERROR
Serious errors, operation failed
logger.error("Failed to process page %d: %s", page_num, e)


CRITICAL
Critical errors, system cannot continue
logger.critical("API key not found, cannot proceed")



4.2. Log Message Format¶
Format: "Action failed: %s", error_details
✅ Good:
logger.error("Failed to load config file %s: %s", config_path, e)
logger.warning("Rate limit reached. Waiting %.2f seconds...", wait_time)
logger.info("Processed %d pages in %.2f seconds", page_count, elapsed)

❌ Bad:
logger.error(f"Failed to load config file {config_path}: {e}")  # Don't use f-strings!
logger.error("Error!")  # Not descriptive!
logger.error(str(e))  # Missing context!

Why avoid f-strings in logging?
- f-strings are evaluated before the log level check (performance overhead)
- %s formatting is only evaluated if the log level is enabled
- Better for structured logging
4.3. Exception Stack Traces¶
Use exc_info=True to include full stack trace:
✅ Good:
try:
    result = process_page(page_num)
except PageProcessingError as e:
    logger.error("Error processing page %d: %s", page_num, e, exc_info=True)

When to use exc_info=True?
- For unexpected errors at top-level handlers
- When debugging complex issues
- For errors that should never happen
When NOT to use exc_info=True?
- For expected errors (rate limits, missing files)
- For informational warnings
- When stack trace would be too verbose

5. Error Recovery Strategies¶
5.1. Graceful Degradation¶
Continue processing with reduced functionality:
try:
    import fitz  # PyMuPDF
except ImportError:
    logger.warning("PyMuPDF not available. Multi-column detection disabled.")
    fitz = None

# Later in code
if fitz is not None:
    # Use advanced multi-column detection
    layout = detect_multi_column_layout(page)
else:
    # Fallback to basic processing
    layout = None

5.2. Retry Logic¶
Retry on transient failures:
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10),
    reraise=True
)
def api_call_with_retry():
    try:
        return self.client.chat.completions.create(...)
    except APITimeoutError:
        logger.warning("API timeout, retrying...")
        raise

5.3. Fallback Values¶
Return safe defaults on error:
def load_config(config_path: Path) -> dict[str, Any]:
    """Load configuration file with fallback to empty dict."""
    try:
        with open(config_path) as f:
            return yaml.safe_load(f)
    except FileNotFoundError:
        logger.debug("Config file not found: %s", config_path)
        return {}
    except yaml.YAMLError as e:
        logger.warning("Failed to parse config file %s: %s", config_path, e)
        return {}


6. Testing Error Handling¶
Always test error paths:
def test_invalid_tier_raises_error():
    """Test that InvalidConfigError is raised for invalid tier."""
    with pytest.raises(InvalidConfigError, match="Invalid tier"):
        rate_limiter.set_tier_and_model("invalid_tier", "gemini-2.5-flash")

def test_missing_api_key_raises_error():
    """Test that MissingConfigError is raised when API key is missing."""
    with pytest.raises(MissingConfigError, match="API key not found"):
        OpenAIClient(api_key=None)

def test_file_not_found_raises_error():
    """Test that FileLoadError is raised for non-existent file."""
    with pytest.raises(FileLoadError, match="File not found"):
        load_pdf("/nonexistent/file.pdf")


7. Migration Guide¶
When migrating from except Exception to specific exceptions:
Step 1: Identify the error type¶
Look at what can go wrong in the try block:
# Before
try:
    config = yaml.safe_load(f)
except Exception as e:
    logger.error("Error: %s", e)
    return {}

Step 2: Choose the appropriate custom exception¶

YAML parsing error → InvalidConfigError
File I/O error → FileLoadError

Step 3: Add proper error context¶
# After
try:
    config = yaml.safe_load(f)
except yaml.YAMLError as e:
    raise InvalidConfigError(f"Malformed YAML in {config_file}: {e}") from e
except OSError as e:
    raise FileLoadError(f"Failed to read config file {config_file}: {e}") from e

Step 4: Update logging¶
# Caller code
try:
    config = load_config(config_path)
except InvalidConfigError as e:
    logger.error("Configuration error: %s", e)
    return {}
except FileLoadError as e:
    logger.warning("Config file not found: %s", e)
    return {}


Last Updated: 2025-01-26
See Also:
- pipeline/exceptions.py - Custom exception definitions
- CLAUDE.md - Project coding standards
- .cursorrules - Detailed coding guidelines

Level	When to Use	Example
`DEBUG`	Detailed diagnostic information	`logger.debug("Processing block %d of %d", i, total)`
`INFO`	General informational messages	`logger.info("Loaded %d pages from %s", len(pages), pdf_path)`
`WARNING`	Recoverable errors, fallback used	`logger.warning("PyMuPDF not available, using basic sorter")`
`ERROR`	Serious errors, operation failed	`logger.error("Failed to process page %d: %s", page_num, e)`
`CRITICAL`	Critical errors, system cannot continue	`logger.critical("API key not found, cannot proceed")`