PDF Association: At the PDF Days Online 2021, you will be hosting a presentation titled “Making sense of PDF structures in the wild at scale” – what’s that about?
Tim Allison: Our team has been supporting secure parser developers working on the Defense Advanced Research Projects Agency (DARPA)’s SafeDocs program. As part of that project our team has built a search and discovery system using open source tools to allow parser developers and specification writers to analyze and find patterns in features extracted from millions of PDFs. These patterns include, for example, correlations between structural elements and creator tools and many other critical features of PDFs at scale.
PDF Association: Who is your presentation aimed at?
Tim Allison: Anyone interested in making sense of PDFs as they are generated in the wild. As mentioned above, this includes anyone developing PDF processing software or writing specifications to improve such software. Our primary corpus derives from Common Crawl (https://commoncrawl.org/), and it offers a reasonable view into PDFs on the web.
PDF Association: What will the people who attend your presentation be able to take away from it?
Tim Allison: Lessons learned on methods for scaling the gathering and feature extraction from millions of PDFs; the utility of analyzing PDFs at scale for anyone involved in PDF processing.
PDF Association: The PDF Days Online 2021 has become the leading PDF event. What makes the PDF Days so unique in your mind?
Tim Allison: The opportunity to meet so many key industry stakeholders and to learn from the people developing the future of PDF.
PDF Association: Thank you! We look forward to seeing you at the PDF Days Online 2021.
The staff of the PDF Association are dedicated to delivering the information, services and value the members have come to expect. Staff members of the PDF Association include: Alexandra Oettler (Editor) Betsy Fanning (Standards Director) Duff Johnson (Chief Executive Officer) Matthias Wagner (Operations Director) Nicole Gauger (Editor) Peter Wyatt (Chief Technology Officer) Thomas Zellmann (PDF Evangelist)