If a PDF document appears to contain text when you view it on-screen, it's possible for that text to actually just be an image of text, especially if the PDF was generated by scanning a paper document. Knowing whether a PDF is image-based or not is very important if you plan on extracting text or structured data from it or other PDFs from the same source: image-based documents will need OCR pre-processing to add a layer of real characters in the PDF before text or structured data recovery will succeed.
In this post, you'll learn how to easily tell in seconds whether a PDF document is image-based, or contains extractable text.
As I discuss elsewhere, the "data model" provided by PDF documents is an impoverished one if your goal is to extract text or structured data from them. In short, every PDF is a sequence of instructions describing how to render each page to a (relatively low-level) graphics context. In order to produce a page that looks like a document you'd recognize — an invoice, a tax form, a resumé — PDF offers only a couple of options for rendering text:
UTF-8
, ISO-8859-1
, or similar.
Let's run through how to tell when each method has been used in rendering a PDF document. The only tools you'll need are:
Ctrl-C
or Cmd-C
,
depending on your operating system).
Ctrl-V
or Cmd-V
). If you see the same characters
that you selected, great! You have a PDF that contains extractable text.
In summary, if you can select and then copy-and-paste text from a PDF document and the pasted result contains the same characters as what you selected, then you have a PDF ready for text and structured data extraction.
Following the same procedure as above, if the text that you paste into your text editor is "junk" — nonsense words, strange glyphs, or perhaps even whitespace — then your PDF document does not contain extractable text.
In this case, text is being rendered using an image-based font embedded in the document. Such fonts often have custom encodings that aren't included in the PDF, and so the character codes used to render text cannot be used to extract that text.
Unfortunately, this case is equivalent to fully image-based PDF documents: an OCR preprocessing step will be needed before any text or structured data extraction will be possible.
Following the same procedure as above, if you cannot even select text in a source PDF, then the document is almost surely image-based. This is extremely common for scanned PDF documents, where PDF is just being used as a container for a series of page scans.
Scanned PDF documents need an OCR preprocessing step before any text or structured data extraction will be possible.
Being able to reliably extract text content or structured data from image-based PDFs (whether the result of scanning paper documents, or PDFs that use image-based fonts as discussed above) requires first applying an OCR (Optical Character Recognition) process to those PDFs. An OCR pass will add a layer of real text over top the original page images, which a tool or service like PDFDATA.io can then use to extract bodies of text, or use as the basis for structured data recovery.
(The output of an OCR process over PDF that produces a text-enriched PDF is sometimes called a "searchable PDF", since it can be readily indexed and found by e.g. a CMS or search engine.)
We don't yet offer an integrated OCR step via PDFDATA.io, but if you are planning on using our services to drive your text and structured data extraction, ask us which OCR tool or vendor would be most appropriate for your situation. There are many different OCR tools and vendors, each of which have different tradeoffs when it comes to the types of documents for which they are best suited.