Turn PDF documents into data sources

pdfQL is our query language built for PDFs that turns them into a source of reliable structured data.

PDFDATA makes using pdfQL easy, with simple in-browser tools anyone can use, and a set of APIs for developers to automate workflows.

pdfQL: No Magic, Just the Right Tool

The best parts of SQL, for PDFs

  • Uses a declarative model: describe the shape of data you want, not how to find it
  • Indexing of data primitives to make queries fast
  • Offers arbitrary predicates and familiar relations (equivalent to AND, OR, NOT, …) to filter or expand results with unlimited flexibility
  • Requires no training on or knowledge of the (1000+ page) PDF specification

Data how you want it

You get to name every data element in your pdfQL queries, and thus every data element produced by running them. Match those names to those used in your existing databases, spreadsheets, and other systems for easy-peasy ingestion.

Flip one query flag, and you can receive extracted data in JSON or CSV to match your existing tools and skills.                

Queries, not fragile, costly programs

Stop using regular programming languages to build fragile "parsers" for badly-extracted PDF text that can't express even simple spatial relationships and style expectations.

pdfQL knows everything about your source documents: on-page positions, font styles and sizes and colors, spatial relationships between content and lines and boxes.

Expert personal support

While PDFDATA and pdfQL are available in this early-bird phase, all paid plans include premium pdfQL authoring services. Yes, we'll write and maintain your pdfQL queries for you!

What does pdfQL look like?

pdfQL provides a vocabulary tailor-made for describing what data you expect to find in a PDF document. It lets you naturally refer to the important aspects of PDF formatting that uniquely identifies that data — text formatting, color, lines and images and relative positions on a page — while ignoring formatting and style that is irrelevant or often varies between different documents that nevertheless carry the same sorts of data.

A screenshot of PDFDATA's pdfQL editor, showing a simple sample query.

Ready to turn PDFs into data sources?

