Open-Source Python OCR Library

Stop Blindly
Using OCR

PreOCR is an open-source Python OCR and document classification library that decides when you actually need vision OCR. Save up to 90% in GPU/CPU cycles by extracting native text in 0ms and only running heavy OCR when required.

• Cut OCR costs by up to 90% by avoiding unnecessary vision runs
• Keep text fidelity at 100% by using native PDF text whenever possible
• Ship fast with a Python SDK and simple HTTP API for your backend

pip install preocr

Read the Docs/Try the API Explorer/Learn from the OCR Blog

CORE_LIBRARY_FETCHING...

0.2ms

multi_page_doc.pdf

Library_01NATIVE

Fast_Stream

Library_02VISION

VLM Fallback

Analysis Mode: Autonomous_Router_v5.2

Document AI Routing Engine

Optimizing OCR & Native Text Extraction for RAG Pipelines

Effort: Minimal

Fast-Track

Direct extraction of native vector layers. Zero vision compute required. Ideal for digital-first documents and system-generated PDFs.

Compute Effort2%

Router_v5

Effort: Full Cognitive

Cognitive Vision

Advanced fallback for scanned images, poor quality photos, or handwriting. Triggers VLM analysis for complex layout recovery.

Compute Effort94%

Decision Latency

0.02ms

PER PAGE ANALYSIS

CPU Effort Saved

84% Avg

VS TRADITIONAL OCR

RAG Accuracy

99.9%

NATIVE FIDELITY

Precision Layout Understanding

Unstructured Layout Mapping•Real-Time Bounding Box Analysis

Loading PDF...

Upload a file and click ANALYZE

Awaiting Signal

Input StreamKERNEL V7.1

Click to select a document file...

# PRE_SCAN: INITIALIZED

Targeting structural anchors and form fields...

Serialization Target

System_Capacity: Unrestricted

Document AI Use Cases at Scale

Healthcare, Invoices, Enterprise PDFs & RAG Pipelines

COMPLIANCE_MODE

Healthcare

Extract patient data from medical records and forms while maintaining privacy and speed.

Privacy Score100%

LOAD_FACTOR2% COMPUTE

BATCH_FINANCE

Invoices

Automate accounts payable by instantly extracting vendor, date, and total amounts.

Tax Accuracy99.9%

LOAD_FACTOR5% COMPUTE

DOC_ORCHESTRATION

Enterprise PDFs

Process thousands of contracts, reports, and whitepapers for search indexing.

Throughput50k/hr

LOAD_FACTOR12% COMPUTE

NEURAL_INGEST

AI/RAG Pipelines

Feed clean, structured text into your LLMs for better Retrieval Augmented Generation.

Token QualityHigh

LOAD_FACTOR8% COMPUTE

10.2PB Processed

0.02ms Latency

94% CPU Saved

Analysis_Mode: Comparative_Benchmark_v1.0.2

PreOCR Benchmarks

Performance Matrix vs Industry Standards

ENGINE_MODULE	verifiedPREOCR_CORE Target_Lead	UNSTRUCTURED.IO	DOCUGAMI
SpeedLATENCY_CRITICAL	✅ < 1s (Internal Benchmarks)	⚠️ 5-10 seconds	⚠️ 10-20 seconds
Cost OptimizationCOMPUTE_EFFICIENCY	✅ Optimized Focus (Skip OCR 50-70%)	❌ Not publicly documented	❌ Not publicly documented
Page-Level ProcessingGRANULAR_ROUTING	✅ Native Routing Focus	❌ Entire File Focus	❌ Entire File Focus
Type SafetySCHEMA_INTEGRITY	✅ Pydantic-First Architecture	⚠️ Basic Dictionary	⚠️ Basic Dictionary
Confidence ScoresVAL_CONF_V3	✅ Per-element + overall	❌ No	✅ Yes
Forms ExtractionFORM_LOGIC	✅ Optimized Support	❌ No	✅ Yes
Markdown OutputRAG_INGEST_READY	✅ LLM-Ready Markdown	✅ Yes	⚠️ XML only focus
Open SourceLICENSE_PERMISSIVE	✅ Apache 2.0	✅ Partial / Core	❌ Commercial

analytics

Benchmarking Sample

10k Mixed-Format PDFs

v1.0.0_STABLE

OCT_2024_RUN

gavel

Legal Accuracy

Validated by Pydantic

100%_PASS

SCHEMA_STRICT

LEGAL_NOTICE: ACTIVE

BENCHMARK_DISCLAIMER: Performance metrics are derived from internal testing using standard document datasets (Oct 2024). Actual processing speeds, accuracy scores, and cost savings may vary depending on document complexity, infrastructure configuration, and third-party API latency. References to external entities (Unstructured.io, Docugami) are provided for illustrative comparative purposes based on publicly available documentation and may not reflect real-time updates to their respective platforms.

Pipeline_Phase: Contributor_Onboarding

How to Contribute

Scaling Open Source Extraction Intelligence

v1.0.1SOURCE_FORK_REQ

Fork the Repo

Create your own copy of the PreOCR repository on GitHub to initiate your local development stream.

Setup Impact15%

COMMIT_READYSTATUS: OK

fork_right

v1.0.2ISSUE_TRIAGE

Find an Issue

Scan the backlog for issues labeled 'good first issue'. Identify the target module for optimization.

Triage Accuracy45%

COMMIT_READYSTATUS: OK

v1.0.3FEATURE_BRANCH

Create Branch

Develop your features or fixes in a clean isolation branch. Maintain semantic commit history.

Code Isolation70%

COMMIT_READYSTATUS: OK

account_tree

v1.0.4CI_VAL_ACTIVE

Test Changes

Execute unit tests and ensure Pydantic models maintain 100% schema integrity across versions.

Test Coverage92%

COMMIT_READYSTATUS: OK

rule

v1.0.5MERGE_READY

Open PR

Submit your PR for review. Join the core contributor list and help scale the future of PreOCR.

Merge Prob.100%

COMMIT_READYSTATUS: OK

merge_type

Network_Status: Synchronized

Built by the Community

Scaling the Neural Knowledge Base

COMMUNITY_NODE_SYNC: ACTIVE

Ready to make an impact?

Join the growing list of contributors building the future of open source document processing. Every PR helps make extraction faster, more accurate, and more accessible globally.

Network Growth94%

Inflow VelocityHigh

Good First Issues open_in_new Contribute on GitHub terminal Join Slack forum

ID: PR_STREAM_READY|MODE: OPEN_SOURCE_SCALE

Frequently Asked Questions

PreOCR & Document AI FAQ

Answers to the most common questions about routing, OCR, and Python integration

How can I reduce OCR costs in large-scale document processing?

You can reduce OCR costs by first classifying documents to determine whether OCR is actually required. In high-volume pipelines processing thousands or millions of documents per month, many PDFs already contain embedded machine-readable text and do not need OCR. A smart pre-processing layer like PreOCR detects scanned vs digital documents and runs OCR only when necessary, reducing API costs, compute usage, and overall processing time.

Do all PDFs require OCR?

No. Many PDFs are digitally generated and already contain selectable, machine-readable text. OCR is only required for scanned or image-based documents. Running OCR on text-based PDFs increases cost and latency without improving extraction quality.

How do I detect if a PDF is scanned or digitally generated?

A scanned PDF contains only images and no embedded text layer, while a digitally generated PDF includes selectable text. Document analysis tools like PreOCR inspect file structure and text layers to determine whether OCR is required before processing.

How can I optimize OCR in RAG pipelines?

In Retrieval-Augmented Generation (RAG) pipelines, OCR should only be applied to documents that truly require it. Adding a document classification step before OCR reduces token usage, infrastructure cost, and processing latency. PreOCR acts as a decision layer in document AI pipelines to ensure OCR is executed only when necessary.

What is PreOCR and how is it different from traditional OCR engines?

PreOCR is not an OCR engine. It is a smart document classification and OCR detection layer that determines whether OCR is needed before sending documents to engines such as Tesseract, AWS Textract, Google Vision, or Azure OCR. This approach prevents unnecessary OCR execution and improves efficiency in enterprise document processing systems.

Can PreOCR work with any OCR engine?

Yes. PreOCR is OCR-engine agnostic and integrates with any OCR provider, including open-source engines and cloud-based APIs. It simply determines whether OCR should run, allowing you to connect it to your preferred OCR system.

How much can OCR costs be reduced using document pre-classification?

OCR costs can often be reduced by 30% to 70% depending on document mix and volume. In large-scale AI systems, a significant portion of PDFs are digitally generated and do not require OCR. By pre-classifying documents and skipping unnecessary OCR calls, organizations can significantly lower API expenses and compute overhead.

Is OCR needed for text-based PDFs in AI pipelines?

No, OCR is not needed for text-based PDFs that already contain embedded machine-readable text. In AI and RAG workflows, applying OCR to such documents increases cost and latency without improving results. A document decision layer can detect whether the PDF is scanned or digital and apply OCR only when required.

Stop BlindlyUsing OCR

Document AI Routing Engine

Fast-Track

Cognitive Vision

Serialization Target

Document AI Use Cases at Scale

Healthcare

Invoices

Enterprise PDFs

AI/RAG Pipelines

PreOCR Benchmarks

How to Contribute

Fork the Repo

Find an Issue

Create Branch

Test Changes

Open PR

Built by the Community

Ready to make an impact?

PreOCR & Document AI FAQ

How can I reduce OCR costs in large-scale document processing?

Do all PDFs require OCR?

How do I detect if a PDF is scanned or digitally generated?

How can I optimize OCR in RAG pipelines?

What is PreOCR and how is it different from traditional OCR engines?

Can PreOCR work with any OCR engine?

How much can OCR costs be reduced using document pre-classification?

Is OCR needed for text-based PDFs in AI pipelines?

Serialization Target

Document AI Use Cases at Scale

Healthcare

Invoices

Enterprise PDFs

AI/RAG Pipelines

PreOCR Benchmarks

How to Contribute

Fork the Repo

Find an Issue

Create Branch

Test Changes

Open PR

Built by the Community

Ready to make an impact?

PreOCR & Document AI FAQ

How can I reduce OCR costs in large-scale document processing?

Do all PDFs require OCR?

How do I detect if a PDF is scanned or digitally generated?

How can I optimize OCR in RAG pipelines?

What is PreOCR and how is it different from traditional OCR engines?

Can PreOCR work with any OCR engine?

How much can OCR costs be reduced using document pre-classification?

Is OCR needed for text-based PDFs in AI pipelines?

Stop Blindly
Using OCR