Optical Character Recognition (OCR) converts images of text into machine-readable data.
But running OCR on every document — even when unnecessary — is slow and expensive.
👉 Most OCR pipelines blindly run OCR on all files.
That leads to:
- ❌ Wasted compute resources
- 🐌 Slower document pipelines
- 💰 Higher OCR processing costs
PreOCR solves this problem.
🧠 What Is PreOCR?
PreOCR is a fast, CPU-only document classification and OCR detection tool built in Python.
Instead of running OCR directly, PreOCR answers one critical question:
Does this document actually need OCR?
It analyzes documents and PDFs to detect:
- Native text availability
- Image-based content
- Layout complexity
- OCR confidence signals
👉 If OCR is unnecessary, PreOCR tells you to skip it.
🔗 GitHub: https://github.com/yuvaraj3855/preocr
📦 PyPI: https://pypi.org/project/preocr/
⚡ Why PreOCR Matters
Most document processing workflows suffer from OCR overuse.
❌ Without PreOCR
- OCR on digital PDFs
- OCR on searchable invoices
- OCR on research papers with extractable text
✅ With PreOCR
- Intelligent OCR detection
- Native text extraction when possible
- 50–70% OCR cost reduction
- Faster document processing
Ideal for:
- Document intelligence systems
- Invoice & receipt automation
- Legal & compliance archives
- AI document ingestion pipelines
🔍 How PreOCR Works
PreOCR uses a hybrid OCR-decision pipeline:
Fast heuristics
- Text presence detection
- Character density checks
- Page-level signals
Layout analysis (optional)
- Image coverage detection
- OpenCV-based refinement
- Scanned vs digital PDF detection
Confidence scoring
- Transparent OCR decision
- Reason codes for automation
⚡ Simple documents resolve in <1 second
🚫 PreOCR never performs OCR itself
🌟 Key Features
- ⚡ Fast CPU-only processing
- 🎯 ~92–95% detection accuracy
- 📄 Page-level OCR decisions
- 🧩 Structured native extraction
- 🔁 Batch processing support
- 📊 Confidence scores & reason codes
- 🛠 Easy OCR engine integration
📦 Installation
Install directly from PyPI:
pip install preocr
For advanced layout analysis:
pip install preocr[layout-refinement]
ℹ️ Requires
libmagicfor file-type detection macOS:brew install libmagicLinux:sudo apt install libmagic1
⚙️ Usage Examples
🔹 Check if a File Needs OCR
from preocr import needs_ocr
result = needs_ocr("document.pdf")
print(result)
Returns:
- OCR required (true/false)
- Confidence score
- Reason codes
🔹 Extract Native Structured Data
from preocr import extract_native_data
data = extract_native_data("document.pdf")
print(data.elements)
Extracts:
- Text blocks
- Tables
- Layout elements
- Regions
🔹 Batch Processing
from preocr import BatchProcessor
processor = BatchProcessor(max_workers=8)
results = processor.process_directory("documents/")
results.print_summary()
🧪 Real-World Use Cases
📄 Large Document Repositories
Skip OCR for searchable PDFs before sending files to cloud OCR engines.
⚙️ Automated Pipelines
Use PreOCR as a pre-filter before expensive OCR services.
🧠 AI & LLM Ingestion
Improve document quality and reduce token noise in AI pipelines.
🔍 PreOCR vs Traditional OCR Pipelines
| Feature | Traditional OCR | PreOCR |
|---|---|---|
| Runs OCR blindly | ❌ Yes | ✅ No |
| Detects native text | ❌ No | ✅ Yes |
| Page-level decisions | ❌ No | ✅ Yes |
| Cost optimization | ❌ No | ✅ 50–70% |
| CPU-only | ❌ Often GPU | ✅ Yes |
🤝 OCR Engines Compatibility
PreOCR is OCR-engine agnostic by design.
It does not bundle or directly integrate with any OCR engine. Instead, it acts as a decision layer that determines whether OCR should be executed in the first place.
Once PreOCR flags a document or page as OCR-required, you can forward it to any OCR engine of your choice, such as:
- Tesseract OCR
- AWS Textract
- Google Vision OCR
- Azure OCR
- MinerU or other paid OCR services
⚠️ Note: These integrations are not built-in.
PreOCR simply outputs a clear OCR decision and reason codes that make downstream OCR orchestration easy.
This design keeps PreOCR:
- Lightweight
- Flexible
- Vendor-neutral
PreOCR augments OCR workflows — it does not replace OCR engines.
❓ Frequently Asked Questions
Does PreOCR replace OCR engines? No. PreOCR only decides whether OCR is required.
Can PreOCR detect scanned PDFs? Yes. It analyzes layout, image density, and text presence.
Is PreOCR open-source? Yes. Fully open-source on GitHub and PyPI.
Does it support page-wise OCR decisions? Yes. Page-level detection is supported.
🚀 Final Thoughts
PreOCR helps teams:
✨ Save processing time ✨ Reduce OCR costs ✨ Optimize document pipelines ✨ Build smarter document workflows
If you work with PDFs, OCR, or document intelligence — PreOCR should be your first step.
👉 GitHub: https://github.com/yuvaraj3855/preocr 📦 PyPI: https://pypi.org/project/preocr/
💬 Community & Support
Join the community and get support:
💬 Slack: https://preocr.slack.com/
Happy building 🚀