PDF documents are everywhere. Unfortunately, while they are useful and pleasant for people, programmatic extraction of data from PDFs is incredibly challenging. This post talks about the history and structure of the PDF format, why it's so ill-suited for data interchange, and what needs to happen in order to make PDF data recovery a reliable part of modern programming and business processes.
Tax forms and invoices, resumés and recipes, insurance forms and financial disclosures, contracts and research, reports and more are all distributed as PDF documents. There's no getting away from them, and they are often the sole source of data essential to your team and organization.
Viewing, reading, and (occasionally) printing PDF documents is very reliable and very pleasant. The original use case for PDF was to make it possible for documents to be reliably distributed and published without resorting to paper and with complete confidence that the reader will see exactly what the author intended. This is a huge benefit over other display and publishing technologies (like the Web), where document authors have little control over exactly how something will be rendered.
While PDF solved a real problem of document reproduction and distribution, its
most common use has come to be conveying data. Unfortunately, since PDF was
designed to reproduce and replace paper documents with perfect fidelity, its
"data model" is perhaps the least useful possible vis á vis data: a sequence
of literal pages upon which individual characters and vector paths and images are
painted. The closest analogue to the PDF "data model" in modern computing
environments is the HTML5 <canvas>
, or equivalent raster contexts in
programming languages like Java or C#. These are fundamentally one-way mediums.
So, even though every PDF document is generated from some authoritative data source(s) — textual content, bitmap images and vector illustrations, tabular and financial data, and so on — its original disposition towards the human concerns of reproducible display aesthetics means that all the relationships among the data that gives it meaning is lost. What's left are individual characters, lines, imagery, and the whitespace that constituted the PDF author's visual design and layout.
From an information-theoretic perspective, PDF generation is a lossy encoding, similar to many image compression algorithms, where an original bitmap can never be recovered from its compressed likeness. An even better analogy is a cipher, where the formatting and layout rules used by the creator of each PDF is equivalent to a secret key provided to a cipher to encrypt some data, and then is lost forever.
There are dozens of ways in which the PDF generation process is lossy, but two of the starkest examples are also the most consequential with regard to identifying and extracting data elements:
Any system that needs data represented in PDF documents cannot simply access it, as one might if given a database, or a CSV or Excel or Word file. No, a program tasked with extracting data from PDF documents must reconstruct it, inferring the structure and relationships between the marks on each virtual PDF page in order to hopefully produce a consistent result.
This is such a challenging problem that most organizations either:
These problems are being solved today, in limited ways. Many different tools and services (including PDFDATA.io) make it possible to extract content from PDF documents (like unstructured text and embedded bitmap images), but very few even attempt to produce structured data, suitable for re-integration into spreadsheets, databases, and other datastores.
PDFDATA.io is changing that. The first step is its
page-templates
operation,
which allows you to define fixed spatial templates that extract named data
elements from particular locations on each PDF page. With a quality template,
you really can reverse the "lossy" transformation that resulted in a PDF
document, recovering the structured data used to produce it.
This approach doesn't work with all PDF documents, unfortunately. Many classes
of documents are simply too variable for page-templates
' fixed spatial
identifications to work reliably. We're just getting started though. In due
time, PDFDATA.io will be offering a general-purpose
way to recover structured data from PDF documents. Stay tuned.