The growing fight against digital misinformation and disinformation has prompted several initiatives from leading publishers, news media, and technology companies. The New York Times (NYT) introduced the News Provenance Project in July 2019. In the same month, BBC and its partners1reported the Trusted News Initiative (TNI). Adobe, NYT, and Twitter started the Content Authenticity Initiative (CAI) in November 2019. Microsoft proposed AMP: Authenticity of Media by Provenance in January 2020. Project Origin, a collaboration between BBC, CBC, Microsoft, and NYT was announced in March 2020. The mission of these initiatives is to establish open industry standards for content attribution and authentication based on provenance data.
CAI’s initial focus is on images. The CAI white paper covers the workflow from the image-capturing device through editing tools to the publishing and sharing platforms. CAI authentication is based on (i) digital signatures and certificates from trusted entities and (ii) the use of tamper-proof cryptographic hashes.
The Project Origin technical paper and AMP2 target all media types and describe building a point of publishing to point of presentation chain of provenance3. AMP, like CAI, relies on digital signatures and cryptographic hashes for content authentication.
Each of these initiatives offers an option, but none makes use of proven internet archival technology4 for establishing provenance. Web ARChive (WARC) is an ISO standard5 with documentation maintained by the International Internet Preservation Consortium (IIPC). WARC is recognized by most national library systems and is the format preferred by the Library of Congress.6Numerous countries accept WARC files for Legal Deposit required by local copyright laws.
The Internet Archive (IA) pioneered the infrastructure for archiving on-line content. IA created the ARC file format in 1996. ARC is the predecessor to WARC. IA launched the WayBack Machine (WBM) in 2001. WBM holds the history of close to a half-trillion web pages today. Archived pages are accessible with internet browsers. WBM provides neither a cryptographic hash nor a digital signature for archived content. Authentication for the archived URL can be obtained by requesting an affidavit for a fee of $250. There is over a decade-long legal history of using WBM affidavits in court proceedings.7
Traditionally – as exemplified by the WBM – a web crawler captures the archives. Alternatively, the content can be captured like a web browser does. The latter approach provides high-fidelity web archives preserving animations and scripts for playback. The IIPC is pursuing the adoption of the latter approach to support current and future playback technologies. IIPC endorses and is planning to financially support the development of the application pywb as the playback tool.8
CLink Media, Inc. (CLink) recently announced a system to support the provision of content and rights metadata for all media types. The CLink platform is built on the proven and scalable Digital Object Architecture (DOA) and on the interoperable metadata framework of the Linked Content Coalition.
Support for WARC-compliant archives in the CLink platform now provides a forthright alternative or additional means for establishing content provenance while providing all the other benefits of a comprehensive digital archive.
A WARC-compliant URL containing the archive date and time becomes part of the metadata associated with the archived content. The archives are certified by cryptographic hashes and digital signatures. The cryptographic hashes are automatically generated by the DOA. CLink intends to support the same trust list and approach described by CAI and AMP for digital signature certification.
The figure below shows the content authentication functions in the CLink platform:
The connectors to Content Authenticity Initiatives in the diagram above are not yet in place because these initiatives are in development: they will become so as the supporting platforms and open standards become available.
Consumer user experience (UX) is a key part of all the recent initiatives cited above. NYT has conducted extensive research studies in its News Provenance Project and found:
Metadata that is recognizable and familiar — like date, location, source, and related photos — offer more to build trust than unfamiliar process or technological descriptions.Technical Paper – IBC 2020
Archives by themselves are a source of intuitive visual information. Archives in the context of comprehensive metadata constitute a dataset consumers can use to decide whether to trust the associated content.9
The CAI white paper also points out:
An optimal UX for viewers will indicate this, through progressive disclosure, without overly complicating the experience.CAI White Paper – August 2020
Guided by those principles,10 CLink provides a simple interface (such as shown on the image above) to enable a consumer to selectively display progressive provenance information for the content. A mouse-over of the image causes an “i” icon to appear. Selecting the “i” triggers display of a menu comprising links to (1) registered metadata records synchronized to embedded metadata, (2) the archive, and (3) CAI or other content authentication data, when they are available.
The same interface for accessing authentication data was applied to this post as an example. The authentication information can be accessed by selecting the “i” icon appearing under the headline. Because the CLink platform provides tools for licensing and republishing content, the same interface is available for the republished post and the image therein.
The certified archive features for HTML pages and images will be available Fall 2020 in the Premium Edition of the wpCLink plugin.
Interested readers can find out more by contacting us at [email protected].
1 AFP, CBC/Radio-Canada, The European Broadcasting Union (EBU), Facebook, Financial Times, First Draft, Google, The Hindu, Microsoft, Reuters, and The Reuters Institute for the Study of Journalism and The Wall Street Journal. The Associated Press and The Washington Post reportedly joined TNI at a later time.
2Technical Paper – IBC 2020. Project Origin relies on AMP. See section heading “AMP service”.
3Id. See the last paragraph on page 3.
4See List of Web archiving initiatives at Wikipedia.
6 The Library of Congress – through the National Digital Information Infrastructure and Preservation Program – led the Memento Project aimed at making Web-archived content more readily discoverable. Memento is defined in RFC 7089.
7 See for example Quarles, J. L. III & Crudo, R., (2014, January/February) [Way]Back to the Future: Using the Wayback Machine in Patent Litigation; Internet Archive Wayback Machine® Helps Lawyers Go Back in Time to Strengthen Cases, Landslide vol. 6, no. 3 , American Bar Association.
8 Kristinn Sigurðsson, The Future of Playback, International Internet Preservation Consortium, (June 16, 2020).
9 CLink – like Project Origin – intends to rely on findings from the New York Times News Provenance Project for its future UX designs.
10 Creator and Credit are displayed in the caption of the image pursuant to common publishing practices.