Optical Character Recognition (OCR) converts images of text into machine-readable data.
But running OCR on every document โ even when unnecessary โ is slow and expensive.
๐ Most OCR pipelines blindly run OCR on all files.
That leads to:
- โ Wasted compute resources
- ๐ Slower document pipelines
- ๐ฐ Higher OCR processing costs
PreOCR solves this problem.
๐ง What Is PreOCR?
PreOCR is a fast, CPU-only document classification and OCR detection tool built in Python.
Instead of running OCR directly, PreOCR answers one critical question:
Does this document actually need OCR?
It analyzes documents and PDFs to detect:
- Native text availability
- Image-based content
- Layout complexity
- OCR confidence signals
๐ If OCR is unnecessary, PreOCR tells you to skip it.
๐ GitHub: https://github.com/yuvaraj3855/preocr
๐ฆ PyPI: https://pypi.org/project/preocr/
โก Why PreOCR Matters
Most document processing workflows suffer from OCR overuse.
โ Without PreOCR
- OCR on digital PDFs
- OCR on searchable invoices
- OCR on research papers with extractable text
โ With PreOCR
- Intelligent OCR detection
- Native text extraction when possible
- 50โ70% OCR cost reduction
- Faster document processing
Ideal for:
- Document intelligence systems
- Invoice & receipt automation
- Legal & compliance archives
- AI document ingestion pipelines
๐ How PreOCR Works
PreOCR uses a hybrid OCR-decision pipeline:
Fast heuristics
- Text presence detection
- Character density checks
- Page-level signals
Layout analysis (optional)
- Image coverage detection
- OpenCV-based refinement
- Scanned vs digital PDF detection
Confidence scoring
- Transparent OCR decision
- Reason codes for automation
โก Simple documents resolve in <1 second
๐ซ PreOCR never performs OCR itself
๐ Key Features
- โก Fast CPU-only processing
- ๐ฏ ~92โ95% detection accuracy
- ๐ Page-level OCR decisions
- ๐งฉ Structured native extraction
- ๐ Batch processing support
- ๐ Confidence scores & reason codes
- ๐ Easy OCR engine integration
๐ฆ Installation
Install directly from PyPI:
pip install preocr
For advanced layout analysis:
pip install preocr[layout-refinement]
โน๏ธ Requires
libmagicfor file-type detection macOS:brew install libmagicLinux:sudo apt install libmagic1
โ๏ธ Usage Examples
๐น Check if a File Needs OCR
from preocr import needs_ocr
result = needs_ocr("document.pdf")
print(result)
Returns:
- OCR required (true/false)
- Confidence score
- Reason codes
๐น Extract Native Structured Data
from preocr import extract_native_data
data = extract_native_data("document.pdf")
print(data.elements)
Extracts:
- Text blocks
- Tables
- Layout elements
- Regions
๐น Batch Processing
from preocr import BatchProcessor
processor = BatchProcessor(max_workers=8)
results = processor.process_directory("documents/")
results.print_summary()
๐งช Real-World Use Cases
๐ Large Document Repositories
Skip OCR for searchable PDFs before sending files to cloud OCR engines.
โ๏ธ Automated Pipelines
Use PreOCR as a pre-filter before expensive OCR services.
๐ง AI & LLM Ingestion
Improve document quality and reduce token noise in AI pipelines.
๐ PreOCR vs Traditional OCR Pipelines
| Feature | Traditional OCR | PreOCR |
|---|---|---|
| Runs OCR blindly | โ Yes | โ No |
| Detects native text | โ No | โ Yes |
| Page-level decisions | โ No | โ Yes |
| Cost optimization | โ No | โ 50โ70% |
| CPU-only | โ Often GPU | โ Yes |
๐ค OCR Engines Compatibility
PreOCR is OCR-engine agnostic by design.
It does not bundle or directly integrate with any OCR engine. Instead, it acts as a decision layer that determines whether OCR should be executed in the first place.
Once PreOCR flags a document or page as OCR-required, you can forward it to any OCR engine of your choice, such as:
- Tesseract OCR
- AWS Textract
- Google Vision OCR
- Azure OCR
- MinerU or other paid OCR services
โ ๏ธ Note: These integrations are not built-in.
PreOCR simply outputs a clear OCR decision and reason codes that make downstream OCR orchestration easy.
This design keeps PreOCR:
- Lightweight
- Flexible
- Vendor-neutral
PreOCR augments OCR workflows โ it does not replace OCR engines.
โ Frequently Asked Questions
Does PreOCR replace OCR engines? No. PreOCR only decides whether OCR is required.
Can PreOCR detect scanned PDFs? Yes. It analyzes layout, image density, and text presence.
Is PreOCR open-source? Yes. Fully open-source on GitHub and PyPI.
Does it support page-wise OCR decisions? Yes. Page-level detection is supported.
๐ Final Thoughts
PreOCR helps teams:
โจ Save processing time โจ Reduce OCR costs โจ Optimize document pipelines โจ Build smarter document workflows
If you work with PDFs, OCR, or document intelligence โ PreOCR should be your first step.
๐ GitHub: https://github.com/yuvaraj3855/preocr ๐ฆ PyPI: https://pypi.org/project/preocr/
๐ฌ Community & Support
Join the community and get support:
๐ฌ Slack: https://preocr.slack.com/
Happy building ๐