Introduction

PreOCR is an open-source Python OCR detection and document extraction library. It determines whether documents need OCR before you run expensive processing. Analyze PDFs, Office documents (DOCX, PPTX, XLSX), images, and text files to detect if they're already machine-readable—helping you skip OCR for 50–70% of documents and cut costs.

Use PreOCR to filter documents before Tesseract, AWS Textract, Google Vision, Azure Document Intelligence, or MinerU. Works offline, CPU-only, with 92–95% accuracy (100% on validation benchmark).

Page-Level Analytics

Detect text density, image-only flags, and font rendering health on a per-page basis. Page-level granularity for mixed digital and scanned PDFs.

Cost Optimization

Save 50–70% in OCR costs by routing digital documents through native extraction. Fast (< 1s per file), structured extraction (Pydantic/JSON/Markdown), type-safe.