high severityDocling (PDF/OCR pipeline)

Garbled, incomplete, or missing text extraction from handwritten content in scanned PDFs/images; e.g., cursive notes transcribed with numerous spelling errors, wrong words, and layout issues; only printed-like fields (e.g., Customer ID) may extract correctly while others fail, as reported in GitHub [#2395](https://github.com/docling-project/docling/issues/2395).

Root cause

Docling relies on general-purpose OCR engines like EasyOCR (default) and Tesseract, which have poor accuracy on cursive/messy handwriting due to lack of handwriting-specific training data and models. These engines excel on printed text but produce garbled output (e.g., 'educational' → 'eclucational') on handwritten notes, as shown in benchmarks and GitHub issues.

doclingocreasyocrtesseracthandwritingaccuracy

Citations