News/Media Alliance study finds pervasive unauthorized use of publisher content to power generative AI technologies


Yesterday, the News/Media Alliance published a White Paper and a technical analysis and submitted comments to the U.S. Copyright Office on the use of publisher content to power generative artificial intelligence technologies (GAI). Together, the three publications document the pervasive, unauthorized use of publisher content by GAI developers, the impact this may have on the sustainability and availability of high-quality original content, and the legal implications of such use. GAI systems have been developed by copying massive amounts of the expressive material published by the Alliance’s members, almost always without authorization or compensation, to create new products and services that frequently compete with Alliance member publishers.

The Alliance recognizes the exciting potential of GAI models and applications to improve aspects of our lives and supports the principled development of these systems. But this development must not come at the expense of publishers and journalists who invest considerable time and resources producing material that keeps our communities informed, safe, and entertained, and holds our government officials and other decision makers in check. The Alliance and its members would welcome working with GAI developers to help build and grow these technologies in a sustainable and responsible manner.

While the Copyright Office submission and White Paper discuss the wider publisher landscape in the face of the GAI revolution, including relevant principles of copyright law, the accompanying technical analysis documents the extent to which GAI developers rely on high-quality journalistic content to power their models. In particular, the results show:

  • GAI developers have copied and used news, magazine and digital media content to train large language models (LLMs).
  • Popular curated datasets underlying LLMs significantly overweight publisher content by a factor ranging from over 5 to almost 100 as compared to the generic collection of content that the well-known entity Common Crawl has scraped from the web.
  • Other studies show that news and digital media ranks third among all categories of sources in Google’s C4 training set, which was used to develop Google’s GAI-powered products like Bard. Half of the top 10 sites represented in the data set are news outlets.
  • The LLMs also copy and use publisher content in their outputs. The LLMs can reproduce the content on which they were trained, demonstrating that the models retain and can memorize the expressive content of the training works.

Alliance President & CEO Danielle Coffey stated, “The research and analysis we've conducted shows that AI companies and developers are not only engaging in unauthorized copying of our members' content to train their products, but they are using it pervasively and to a greater extent than other sources. This shows they recognize our unique value, and yet most of these developers are not obtaining proper permissions through licensing agreements or compensating publishers for the use of this content. This diminishment of high-quality, human created content harms not only publishers but the sustainability of AI models themselves and the availability of reliable, trustworthy information.”

The Copyright Office comments and the White Paper offer multiple recommendations to policymakers, including recognizing that unauthorized use of publishers' expressive content for commercial GAI training and development is likely to compete with and harm publisher businesses in a manner that infringes copyright; creating transparency requirements to require disclosure of the use of copyright protected content in training; encouraging and facilitating effective licensing solutions; supporting international cooperation and harmonization on GAI regulations; and adopting legislation to remedy existing market imbalances that prevent publishers from engaging in fair negotiations for the use of their content against dominant platforms.

Coffey continued, "Generative AI systems should be held responsible and accountable, just like any other business. This White Paper demonstrates that these systems rely on journalistic and creative content, which have the benefit of investment in quality on the front end, as well as publishers who are required by law to take responsibility for the content they share with the public. Continued unauthorized use will harm existing markets that acknowledge the value of archived and real-time quality content, and over time the GAI models themselves will deteriorate. You get out what you put in. It is critical that our copyright protections are properly enforced and that high standards of quality and accountability are the foundation of these and other new technologies."