pdfQL is coming
Turn PDFs into data sources
Get at the data and content trapped in PDFs documents with a simple, featureful API and query language built by experts with decades of experience extracting structured data from PDFs.
pdfQL rolls up the core principles of familiar tools (including SQL and regular expressions) to make PDF documents a source of reliable structured data.
A basic pdfQL example!
pdfQL: the best parts of SQL, for PDFs
- Uses a declarative model: describe the shape of data you want, not how to find it
- Indexing of data primitives to make queries fast
- Offers arbitrary predicates and familiar relations
NOT) to filter results with unlimited flexibility
Queries, not fragile, costly programs
Stop using regular programming languages to build fragile "parsers" for badly-extracted PDF text that can't express even simple spatial relationships and style expectations.
pdfQL queries are declarative, and know everything about your source documents: on-page positions, font styles and sizes and colors, and spatial relationships between content and lines and boxes.
Data how you want it
You get to name every data element in your pdfQL queries, and thus every data element produced by running them. Match those names to those used in your existing databases and systems for easy-peasy ingestion.
Flip one query flag, and you can receive extracted data in JSON or CSV to match whatever is easiest with your existing tools and skills.