As announced in June 2019 (“SafeDocs: DARPA Does PDF”) the PDF Association is now serving as an industry partner in the Defense Advanced Research Projects Agency (DARPA)-funded Safe Documents (SafeDocs) program. The goal of this fundamental research program is to develop novel parser methodologies for ensuring safety in digital content, whether document formats (such as PDF), specialized imaging formats (such as NITF) or streaming data protocols (such as DNS or MAVlink). The philosophy underlying SafeDocs' approach is that of ‘language-theoretic security’ (LangSec), which posits that “the only path to trustworthy software that takes untrusted inputs is treating all valid or expected inputs as a formal language, and the respective input-handling routines as a recognizer for that language. The recognition must be feasible, and the recognizer must match the language in required computation power.”
Through the PDF Association's active involvement in the SafeDocs program, researchers from BAE Systems, Galois Inc., SRI International, Northrop Grumman Systems Corp., Lockheed Martin Corp., Kudu Dynamics, NASA JPL and the many university and cyber-security research teams supporting their efforts have effectively ramped-up their understanding of PDF, both in terms of the file format itself and the realities of PDF's utilization by industry and users. Under SafeDocs, each research team is approaching the same high-level objectives from different directions while also working collaboratively. The challenge for us, as their technology and industry guide, has been to support the many different kinds of microscopes being applied to PDF!
From the PDF industry perspective, in the short time since SafeDocs commenced a number of visible positive outcomes have already occurred:
Researchers initially focused on the GovDocs1 corpora (containing 239K PDF files), primarily because it is a freely available, well-studied real-world corpus collected from US government websites (.gov). However, from a PDF industry perspective, GovDocs1 is over a decade old and has limited PDF technical diversity due to the way it was sourced. SafeDocs' researchers are now currently scaling their technologies to include additional corpora, including Common Crawl and an exciting new PDF-centric issue-tracker corpora under development by NASA JPL with guidance from the PDF Association. The Sixth LangSec IEEE S&P Workshop at the IEEE Security & Privacy Symposium 2020 will be held on May 21, and will include many SafeDocs researchers presenting their latest work on corpora, topological difference testing and other methodologies.
Researchers are demonstrating early progress towards stretching the underlying principles of linguistics, formal language theory, topology and applied category theory, type theory and various other advanced disciplines. New tooling is being rapidly prototyped supporting novel ideas, so that diverse combinations of theory and practice can be easily assessed for feasibility. PDF is being used to push the boundaries of these domains, with research outcomes destined to be broadly applicable to other digital formats.
Some components of the PDF Association's original plan of work, such as surveying industry for security processes and practices around PDF development, and technical benchmarking and metrics of “hidden corpora” will become increasingly relevant once core research problems are addressed and SafeDocs technologies have matured.
From the PDF industry perspective, there is still a long way to go before the research outcomes in development are applicable to the day-to-day business of engineering reliable and interoperable PDF technologies (such as parser construction toolkits). In true research style, no one is entirely sure where these efforts will take us. Along the way, we believe there is real potential to gain important insights and leverage intermediate outcomes to improve interoperability, reliability and security across the PDF industry.
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0079. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Approved for public release.
Peter Wyatt is the PDF Association’s CTO and an independent technology consultant with deep file format and parsing expertise, who is a developer and researcher actively working on PDF technologies for more than 20 years. He is Project co-Leader of ISO 32000 (the core PDF standard), co-Chairs the PDF Association PDF TWG and is the PDF Association’s Principal Scientist leading …