Why is getting data out of PDF documents so hard?
PDF documents are everywhere. Unfortunately, while they are useful and pleasant for people, programmatic extraction of data from PDFs is incredibly challenging. This post talks about the history and structure of the PDF format, why it’s so ill-suited for data interchange, and what needs to happen in order to make PDF data recovery a reliable part of modern programming and business processes.
PDF documents are everywhere
Tax forms and invoices, resumés and recipes, insurance forms and financial disclosures, contracts and research, reports and more are all distributed as PDF documents. There’s no getting away from them, and they are often the sole source of data essential to your team and organization.
PDF was designed for human consumption
Viewing, reading, and (occasionally) printing PDF documents is very reliable and very pleasant. The original use case for PDF was to make it possible for documents to be reliably distributed and published without resorting to paper and with complete confidence that the reader will see exactly what the author intended. This is a huge benefit over other display and publishing technologies (like the Web), where document authors have little control over exactly how something will be rendered.
PDF is a terrible vehicle for data interchange (but we use it anyway)
While PDF solved a real problem of document reproduction and distribution, its
most common use has come to be conveying data. Unfortunately, since PDF was
designed to reproduce and replace paper documents with perfect fidelity, its
“data model” is perhaps the least useful possible vis á vis data: a sequence
of literal pages upon which individual characters and vector paths and images are
painted. The closest analogue to the PDF “data model” in modern computing
environments is the HTML5
<canvas>, or equivalent raster contexts in
programming languages like Java or C#. These are fundamentally one-way mediums.
So, even though every PDF document is generated from some authoritative data source(s) — textual content, bitmap images and vector illustrations, tabular and financial data, and so on — its original disposition towards the human concerns of reproducible display aesthetics means that all the relationships among the data that gives it meaning is lost. What’s left are individual characters, lines, imagery, and the whitespace that constituted the PDF author’s visual design and layout.
From an information-theoretic perspective, PDF generation is a lossy encoding, similar to many image compression algorithms, where an original bitmap can never be recovered from its compressed likeness. An even better analogy is a cipher, where the formatting and layout rules used by the creator of each PDF is equivalent to a secret key provided to a cipher to encrypt some data, and then is lost forever.
Practical consequences of distributing data via PDF documents
There are dozens of ways in which the PDF generation process is lossy, but two of the starkest examples are also the most consequential with regard to identifying and extracting data elements:
- PDFs carry no structural or logical model for the text they present; all text is stored and displayed as disconnected characters, not words or lines or sentences or blocks or paragraphs or columns. Of course, you can forget anything like an information model: a group of characters might be an address or a date or an invoice number, but nothing in the PDF will tell you that.
- Similarly, tabular or other structured data is rendered in PDF without any underlying model: rows and columns and headers and field labels are rendered using the same primitives as any other text, and the paths sometimes drawn to indicate tabular structure visually use the same primitives as any other vector illustration.
Any system that needs data represented in PDF documents cannot simply access it, as one might if given a database, or a CSV or Excel or Word file. No, a program tasked with extracting data from PDF documents must reconstruct it, inferring the structure and relationships between the marks on each virtual PDF page in order to hopefully produce a consistent result.
This is such a challenging problem that most organizations either:
- Give up, file PDF documents away into a CMS or shared folder, and rely on indexed search to find useful information.
- Rely upon manual data entry to reconstitute the data that’s needed: literally having a person (usually many, many people) laboriously find, read, and type valuable data into a structured repository (often a database).
Reliably extracting data from PDF documents is
These problems are being solved today, in limited ways. Many different tools and services (including PDFDATA.io) make it possible to extract content from PDF documents (like unstructured text and embedded bitmap images), but very few even attempt to produce structured data, suitable for re-integration into spreadsheets, databases, and other datastores.
PDFDATA.io is changing that. The first step is its
which allows you to define fixed spatial templates that extract named data
elements from particular locations on each PDF page. With a quality template,
you really can reverse the “lossy” transformation that resulted in a PDF
document, recovering the structured data used to produce it.
This approach doesn’t work with all PDF documents, unfortunately. Many classes
of documents are simply too variable for
page-templates’ fixed spatial
identifications to work reliably. We’re just getting started though. In due
time, PDFDATA.io will be offering a general-purpose
way to recover structured data from PDF documents. Stay tuned.