In PDF documents, as in HTML, content semantics are expressed via tags, hence “tagged PDF”. Tagged PDF allows for semantically accurate extraction and reuse of text and annotations enabling accessibility, reflow and other applications.
Tagged PDF is an optional feature in the PDF file format and thus not every PDF file is tagged. However modern tools such as Apple’s office suite automatically adds tags when exporting to PDF and Google’s Chrome now creates tagged PDF as well. Older tools may require an explicit option to be enabled when exporting.
On the one hand, the fact that tags are optional means that PDF is extraordinarily flexible in accommodating every type of content from every source imaginable, even when the original source lacks semantics. On the other hand, tags require a knowledgeable document author and capable software to achieve good results.
This article offers an overview of PDF industry activities pertaining to tagged PDF as realized in eight working groups of the PDF Association.
The PDF Association operates several types of working groups, each with a specific focus:
Working Groups (WG) operate on the basis of interest without a specific remit.
These groups are open to all PDF Association members, and in the case of LWGs, invited non-members who wish to participate in the development of specific resources.
Develops and supports the base PDF specification (ISO 32000)
Chaired by Martin Bailey and ISO 32000 Project co-Leader Peter Wyatt the PDF Technical Working Group (TWG) provides resources and community for developers working with ISO 32000, the core PDF specification. The core PDF specification defines the syntactic mechanisms by which tags are added to PDF documents and linked to graphical content.
Today, the PDF TWG meets periodically to review and respond to comments on ISO 32000-2:2020, the latest edition of the PDF standard, via its public pdf-issues GitHub repo. Industry-agreed resolutions as adopted by the PDF TWG are posted for reference.
Although this TWG does not only focus on tagged PDF, all comments or concerns regarding the relevant text in ISO 32000 - clauses 14.6, 14.7 and 14.8 plus Annex L - should be raised in the pdf-issues repo on GitHub.
Develops and supports the specification for accessible PDF (ISO 14289)
This TWG focuses on supporting and enhancing the definition of PDF/UA as the strict PDF subset syntax required to support the screen-readers, magnifiers, highlighters and other assistive technology (AT) that makes content accessible to those with various types of disabilities. PDF/UA-1 was published as ISO 14289-1:2014 and was based on ISO 32000-1:2008 (PDF 1.7). PDF/UA-2 to support PDF 2.0 is still under development in ISO TC 171 SC 2 WG 9 “PDF universal accessibility”.
Chaired by Klass Posselt and Matthew Hardy, the PDF/UA TWG provides resources for developers, marketers and policy makers who need to understand the International Standard for universally accessible PDF. As of April, 2021 this group is principally engaged in supporting development of PDF/UA-2, the next-generation PDF accessibility standard based on PDF 2.0. In addition to PDF/UA-2 the TWG recently published updates to its well-known PDF/UA Reference Suite and Matterhorn Protocol.
Engaged in developing atomic “pass” and “fail” examples based on PDF/UA
Chaired by Markus Erle the PDF Accessibility Liaison Working Group (LWG) was originally formed to continue the work started at the December 2018 PDF Techniques Accessibility Summit; producing industry-supported example PDF files demonstrating granular techniques for achieving accessible PDF in the context of PDF/UA-1. The group will begin to develop PDF 2.0 examples supporting PDF/UA-2 once that specification is complete.
As of April 2021 the PDF Accessibility LWG meets every two weeks to review candidate "pass" and "fail" PDF files and develop appropriate metadata. A website interface to provide access to the example files is under development.
Developing guidance on reusing PDF in a web context.
Replacing the now-retired ResponsivePDF group, this TWG, chaired by Roman Toda, explores the complexities of generating valid HTML from Tagged PDF content.
At present the Deriving HTML from PDF TWG is working on several projects pertaining to researching issues around content repurposing, extraction, reflow and responsiveness in the PDF context. Its most recent publication was Deriving HTML from PDF. This group works in close coordination with the PDF Reuse TWG and PDF Forms TWG.
Developing a specification for reusable PDF aligned with PDF/UA.
Formed in 2020 and chaired by Matthew Hardy the group’s initial project is to complete development of a specification for “well-tagged PDF”, that is, PDF documents that leverage tagged PDF to enable reliable reuse of document content across various devices. The end-product is intended to complement the derivation algorithm developed by the ResponsivePDF TWG as specified in Deriving HTML from PDF.
To ensure continuity between specifications for reuse and accessibility this group works in close coordination with the development of PDF/UA-2 in the PDF/UA TWG.
Making tagged PDF natural in LaTeX.
The LaTeX Project, developer of the typesetting system used by academic and commercial STEM authors worldwide, recently announced a project to introduce full support for tagged PDF in LaTeX as required by accessibility standards such as PDF/UA.
At the Project’s request the PDF Association established this LWG, chaired by Board member (and LaTeX user) Boris Doubrov to allow LaTeX Project contributors to collaborate with members of the PDF Association to help drive tagged PDF support in LaTeX. In addition, LaTeX developers are sharing their content structuring needs with PDF experts to help advance understandings about user and use case needs.
Participation in this LWG is open to all PDF Association members, and non-members by invitation.
Dealing with real-world demands on remediators.
Chaired by Paul Rayius and Patrick Scouten, the Accessibility Service Bureau Working Group (ASBWG) addresses the specific interests of service providers and helps establish best practices for accessibility considerations in PDF content authoring and remediation. This group has two major objectives:
Participation is open to all PDF Association members.
Helping foster awareness of standardized accessible PDF.
Recently re-started, the PDF/UA Marketing Working Group currently chaired by Peter Shikli is working on resources to help drive marketplace awareness and understanding of tagged PDF and PDF/UA in particular. Current projects include: “mythbusting” about accessible PDF, developing advice on work-arounds for common problems, a knowledge base and other content intended to help decision-makers, educators and software vendors find common ground on concepts and framings.
The staff of the PDF Association are dedicated to delivering the information, services and value the members have come to expect. Staff members of the PDF Association include: Alexandra Oettler (Editor) Betsy Fanning (Standards Director) Duff Johnson (Chief Executive Officer) Matthias Wagner (Operations Director) Nicole Gauger (Editor) Peter Wyatt (Chief Technology Officer) Thomas Zellmann (PDF Evangelist)