Sources on AI use of Information

  1. Consent in Crisis: The Rapid Decline of the AI Data Commons

Abstract: General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, Refined Web, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14, 000 web domains provides an expansive view of crawlable web data and how consent preferences to use it are changing over time. We observe a proliferation of AI specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites’ expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crisis in data consent, foreclosing much of the open web, not only for commercial AI, but non-commercial AI and academic purposes.

AI is taking out of a commons information that was provisioned under a different set of rules and technology. See discussion on Y Combinator 

2. “ChatGPT-maker braces for fight with New York Times and authors on ‘fair use’ of copyrighted works” (AP, January ’24)

3. Partly handy as a collection of references: “HOW GENERATIVE AI TURNS COPYRIGHT UPSIDE DOWN” by a law professor. “While courts are litigating many copyright issues involving generative AI, from who owns AI-generated works to the fair use of training to infringement by AI outputs, the most fundamental changes generative AI will bring to copyright law don’t fit in any of those categories…” 

4. New gated NBER paper by Josh Gans “examines this issue from an economics perspective”

Joy: AI companies have money. Could we be headed toward a world where OpenAI has some paid writers on staff? Replenishing the commons is relatively cheap if done strategically, in relation to the money being raised for AI companies. Jeff Bezos bought the Washington Post. It cost a fraction of his tech fortune (about $250 million). Elon Musk bought Twitter. Sam Altman is rich enough to help keep the NYT churning out articles. Because there are several competing commercial models, however, the owners of LLM products face a commons problem. If Altman pays the NYT to keep operating, then Anthropic gets the benefit, too. Arguably, good writing is already under-provisioned, even aside from LLMs.

2 thoughts on “Sources on AI use of Information

  1. Scott Buchanan's avatar Scott Buchanan August 11, 2025 / 12:11 pm

    Well…maybe Sam Altman should pay writers for content that is only available to OpenAI. That would be fair and still get SOME human input into things.

    Like

Leave a comment