DK_en 3x06 - Legitimate Use My Ass

Share
DK_en 3x06 - Legitimate Use My Ass
Photo by Sebastian Dumitru / Unsplash

Episode first aired on 10 July, 2025. Listen on Spreaker

When conceptual chaos meets the most blatant shilling, you know things will get much worse before they get better.

AUDIO: DK theme

Last week gave us two interesting rulings in the long-running dispute between AI advocates and content producers over whether the large-scale reuse of protected material constitutes a crime.

What emerges from the rulings, to quote the Great Helmsman Mao, is that ther's great confusion under the heavens.

And to top it all off, on the same topic the Creative Commons foundation has presented its candidate for Most Idiotic Idea of the Century, with a good chance of winning the title.

Anthropic

Let's take a quick look at the rulings. The first came from San Francisco federal judge William Alsup, who ruled that Anthropic did not infringe copyright by using the entirety of the plaintiffs' works to train its AI, Claude.

The judge not only ruled that this falls under fair use, but compared the LLM's activity to “a reader who aspires to become a writer” and who therefore uses works “not to replicate and replace them” but to “overcome difficulties and create something different.”

The same judge, in the same ruling, also found that Anthropic had violated copyright by creating and maintaining its own digital archive of over 7 million pirated titles, even though Anthropic subsequently purchased most of them in print.

Judge alsup noted that the fact that Anthropic subsequently purchased a copy of a book it had previously stolen from the Internet does not absolve it from the consequences of the theft, but may affect the amount of compensation.

For this second offense, the judge has set a new trial date in December.

According to the law, compensation could be as high as $150,000 per pirated work; even if we only awarded $10,000 per title, that would be $70 billion for seven million titles, which would immediately throw Anthropic into liquidation.

I don't even think we'll get to $1,000 per pirated title, because even a $7 billion fine would knock out one of America's champions in its heroic fight against China for global supremacy, according to the tropes currently in vogue in the Valley and in Washington.

Therefore it's likely that in the end, if there is a fine at all, it will be a few million, i.e., a dollar or two per stolen title or even less.

This only confirms the saying that if you have to steal, go big.

Meta

The second ruling concerns Meta, and it's an identicalcase. A group of authors, including Sarah Silverman and Ta-Nehisi Coates, sued Meta for copyright infringement for using their works to train its AI, Llama, without their permission.

San Francisco District Judge Vince Chhabria, ruled in favor of Meta, saying that the authors had not presented enough evidence that Meta's AI would harm them by diluting the market for their works.

In essence, the authors' argument was: who will buy my books if all you have to do is ask Llama (Facebook's AI) to summarize them? This is worrying, becaue it takes for granted that if you ask an LLM to summarize a work, that summary is true. And we do know it's not.

How many times must we repeat that language models have no constraints to reality, and that their output is realistic but, until proven otherwise, not facutal?

And anyway, since we live in desperate but not serious times, the judge hastened to clarify that:

...this decision does not endorse the notion that Meta's use of copyrighted material to train its language model is lawful, [...] but only endorses the idea that the plaintiffs presented the wrong argument and failed to produce evidence in support of the correct argument.

Creative Signals

It's not over yet. The Creative Commons Foundation has just come out with a public consultation on its own project called “Creative Signals”. I will explain the idea in their own words:

Now that artificial intelligence (AI) is transforming how knowledge is created, shared, and reused, we are at a crossroads that will define the future of access to knowledge and shared creativity. One path leads to data extraction and the erosion of openness; the other leads to a walled and paywalled internet. CC Signals offer another path, grounded in the nuanced values of the commons expressed by the collective.
CC Signals will allow data holders to signal their preferences for how their content can be reused by machines, based on a limited but meaningful set of options shaped in the public interest. This is both a technical and legal tool and a social proposal: a call for a new pact between those who share data and those who use it to train artificial intelligence models.

What inspiring words; such sharing, such social spirit, such heavenly harmony. Let's all run into the woods, strip ourselves naked, and sing “Kumbaya” while dancing in a circle.

Right now on GitHub, where feedback on this proposal is being collected, the most upvoted comment says, among other things:

Inviting web bots of language models to negotiate license terms for CC works is like a flock of sheep holding a summit to draw up ethical guidelines on how wolves can best enjoy sheep meat, discussing whether rosemary or thyme best complements their sacrifice, already seasoned, of course, by the attached consent forms. We need a fence, not a cookbook.

And he's not wrong. In recent weeks and months, every website operator has seen the number of visits from scrapers from every AI company skyrocket, almost always with requests so aggressive that they constitute actual Denial-of-Service attacks, not to mention the additional costs of traffic spikes.

In the midst of all this, Creative Commons has come up with a proposal to present the “preferences” of rights holders to companies that have already fed their language models with the entire content of the Internet, and have done so while completely ignoring the one already existing signal: the robots.txt file, which is supposed to limit indiscriminate access by bots to various areas of a website.

I can only imagine the success of signals such as

Attribution: “You must attribute the source appropriately based on the method, means, and context of your use.”

or

Direct contribution: You must provide monetary or in-kind support to the declaring party for the development and maintenance of the assets, based on a good faith assessment that takes into account your use of the assets and your financial means.

or even the most surreal:

Contribution to the ecosystem: You must provide monetary or in-kind support to the ecosystem from which you are benefiting, based on a good faith assessment that takes into account your use of the assets and your financial means.

These are very lofty concepts, but they come up against the fact that the lords of linguistic models see everything available on the Internet as sheep to fatten their wolves.

And as the github commenter said, we need fences, not recipe books.

The fundamental problem with Creative Signals, in my opinion, is that it starts from the wrong premises:

If everyone denies access, everyone loses.

No, dear friends, the only ones who lose are the techno-feudalists who, under the guise of training, demand free access to all content in order to produce their commercial tools.

Artificial intelligence is transforming the way knowledge is created, shared, and reused.

LLMs, by definition, do not create or share knowledge. They merely regurgitate that produced by others, spewing out formally plausible texts with no guaranteed connection to reality.

The idea that Artificial Intelligence is inherently good, that it is an inevitable technology, and that it will inevitably lead to progress and abundance, to use the buzzwords, is not a legitimate point of view.

It is the slogan, the sales pitch of Altman and his ilk.
A pitch that, incidentally, is immediately disproved as soon as you try and use the supposed intelligence of these linguistic models.

But how is it possible that Creative Commons, which with its licenses has effectively made the free web possible, comes up with such a proposal?

Simple.

Creative Commons, like the Electronic Frontier Foundation, are nothing more than lobbyists for Big Tech, and their strategy is to present the demands of industry as social demands.

If you remember, both were strongly in favor of NFTs, Bitcoin, and the Metaverse. Then they changed their minds.
But whoever pays the piper calls the tune, and the budgets of EFF and CC are kept afloat by GAFAM, not by neighborhood committees.

Conclusion

The US rulings confront us with a difficult, uncomfortable reality.

The techno-feudal lords, with the “training” of their chatbots, have found a loophole to take free ownership of all content shared on the internet, under any license.

In fact, on the one hand, I am furious about the rulings in favor of big tech, but on the other hand, I recognize that given how we have defined fair use, it is not possible to exclude AI training.

Fair use makes a simple distinction: if the use of some material is transformative, then it is fair use. If, on the other hand, the use is derivative, there is no fair use and the rights of use must be obtained by paying for them.

In simple terms: if I write a play about how Leonardo painted the Mona Lisa, I can do so freely and owe nothing to the Louvre. But if I want to sell postcards with the Mona Lisa on them, I have to pay for the rights.

The problem, then, would be to prove that language models replicate and do not transform. This is very difficult for two reasons:

  1. no one knows how a language model works in detail, including those who built it
  2. it is easy to prove that there is transformative use by showing that an LLM does not memorize training data.

I know, there are tons of papers that show that an LLM can reproduce entire portions of training data.

But these are never complete works, and the slight differences between the original and the replica can easily be assimilated to what any reader does, who may quote a paragraph or a page from memory, and is absolutely within their rights to do so.

And as Judge Alsup shows us, the idea that an LLM is something that “learns” in the human sense of the word (and therefore comparable to a student who reads a text and “makes it their own” without having to pay royalties) is now outdated.

As if the armies of Big Tech lawyers weren't enough, we also have to deal with judges who anthropomorphize LLMs, and frankly, I don't see that ending well.

I believe that the battle to determine whether indiscriminate scraping for “AI training” purposes is a violation of copyright is already lost. The only possibility I see is to ban it outright, which would mean putting a tombstone on the thriving speculative bubble known as “artificial intelligence, language model version.” Yes, I like my science fiction.

What we are witnessing is the perversion of a mechanism designed to guarantee free access to knowledge, to preserve libraries and individual use, to the benefit of those who aspire to become monopolists of knowledge.

Because the only endgame of language models is to become the gatekeepers, the mandatory access point to all knowledge. Because that's the only way they can become profitable.

It doesn't matter if language models are as reliable as a drunk parrot, because after years of propaganda, the public has internalized that you ask ChatGPT a question and it gives you THE ANSWER.

As a society, we will have to relearn at our own expense what an authoritative source is. And we will pay dearly.

Unless.

Unless we arm our websites with tarpits that poison AI with gigabytes and gigabytes of fictitious text, press our representatives in Europe to eliminate scraping from practices deemed fair use, refuse to use language models, and above all object to their use on us.

Did Copilot take notes automatically during the interview? Ask for a full transcript and revoke consent for its use from that point onwards.

Is the school pushing for Google Workspace with Gemini? Teach lessons on the blackboard, assign exercises from the book, and give written tests in class. Let's leave the ticking boxes to the Americans.

Does your boss want you to use Copilot? Fine. Write down every mistake, every stupid thing, whether it comes from Copilot or is found in your boss's or CEO's emails. Keep receipts. And find yourself another boss, or another job, because the company is on its last legs.

Anyone who wants to, can use AI to study. But every piece of AI-flavored nonsense is worth three points: one for the mistake, one for the lack of critical thinking, and one for trying to fool the teacher.

It takes effort, but it's the only way. We must do everything within the bounds of the law to defend the quality and health of our epistemological space from the pollution represented by language models.

And we must do so while there is still time. Because after that, all we will have left are torches.