IP, Copyright, and LLMs

Brad Hutchings
12 min readMay 22, 2024

--

There is a growing sense that copyright infringement might take down artificial intelligence (AI). The narrative is that companies like OpenAI illegally used material protected by copyright to train large language models (LLMs). Further, and this is very important to the narrative, the training was not authorized by the copyright owners. Let’s do a thorough dissection of these claims and their implications.

First, let’s set some bounds on the discussion:

  • I am not a lawyer, nor am I any kind of expert in law. My training is in computer science. I have an MS from UC Irvine, with a concentration in algorithms and data structures.
  • I am writing about LLMs which generally take written human language as input (prompts) and return written human language as the output (answers). I am not writing about image, sound, or video generation technologies.
  • I am writing about United States federal copyright law, both as legislated by Congress and as limited by the courts. I am writing about traditions such as Fair Use, of which the “tech resistance” was all in on a generation ago when y’all’s were panicked about Disney owning and controlling the culture. Just sayin’…
  • There are a myriad of legitimate complaints about LLMs, from disinformation to inappropriateness to “hallucinating” to incorrectly reasoning to what I call “artificial certitude”. There are a myriad of illegitimate complaints as well. I’m not addressing any of those here. This is not a general defense of or apology for LLMs.
  • This is my informed and uninformed opinion. There are plenty of mistakes, errors, and omissions in my argument, just as there are in both sides of the issue. I’m happy if this helps you find, fix, and/or exploit any of the holes. If I can be of any paid assistance to you in doing so, drop me an email.

What is an LLM?

An LLM is a statistical model, typically covering petabytes (thousands of terabytes) in billions of source documents. Multiply by a factor of 10 for the largest models, like OpenAI’s GPT-4. Divide by a factor of 10 for really good models that can run on your pro/beefy laptop, like Mistral-7B-Instruct.

I have the Mistral-7B-Instruct model running on my laptop, available for me to use privately. Computer nerds can do this now, and many of us are working to make a plethora of models available for private use by regular users on regular laptops. I mention this because, in the IP maximalist view of LLMs and copyright, it’s not just OpenAI in their crosshairs. It’s people like me who will never care about plaintiffs’ particular IP and never even run across it in our private interactions. Know your enemy.

The Mistral-7B-Instruct model is about 14 gigabytes (GB) on disk. The company claims it is trained on a massive corpus on English language data scraped from the public web, but doesn’t give any details. Let’s conservatively assume 100 terabytes (TB) of data. All of that data is compressed at a ratio of about 10,000:1. Typically with text data, lossless compression like zip and 7zip files yield 5:1 compression in the better cases. Something else is likely going on here. I call it super lossy compression and I motivate it in this article.

Side note: estimates of the dollar cost for computation for Mistral-7B range from $80K to a few million dollars. Sam Altman has said that GPT-4 cost more than $100M to train. There is a lot of data in the training sets.

A particularly important conclusion to draw is that the training data is not actually in the LLM. This is (almost) basic information theory, as popularized in the field of computer science by Claude Shannon. There is just not enough space to hold a perfect or even mildly degraded copy of the whole thing.

What are LLM Developers Trying to Accomplish?

The ultimate long-range goal of LLM developers is to provide an omniscient oracle. We call this Artificial General Intelligence (AGI). It knows just about everything and can apply that knowledge to give good answers and make good decisions.

In reality where storage space and decision time are limited, LLM makers make huge compromises and settle for lesser, more specific goals. They’d like it to be generally good at using language, have wide or application specific cultural knowledge, and do well on industry crafted metrics that portend to show how good LLMs are on tasks they’re not good at.

Example: In making up stories about well-known public domain fictional characters and tech products, I have found that the LLM on my laptop spins fun tales. ChatGPT also spins fun tales, but feels a little too much for me. My favorite prompt is “Make up a story about Paul Bunyan and Eeyore saving the forest with 7zip.” Yes, the original Eeyore character is finally in the public domain. Nobody’s rights are violated with my frivolity.

In practice, developers pick their massive training corpuses to be good enough at language, knowledgeable enough of culture, and performative on metrics subject to budget and intended deployment constraints. Notice that for most applications, this doesn’t include specific knowledge, narrative, or word choice in any particular book.

Can LLMs be Untrained?

So you didn’t want your work included in some LLM. Could it just be removed? The short answer is “no without completely retraining the model.” See the estimated costs of doing do above. We can assume models are retrained with new versions released periodically, so it’s possible to come up with a system whereby your content is excluded from version next on request. OpenAI is working on such a system they call Media Manager. It might or might not solve the problem.

But couldn’t a work be removed from an existing model, like removing a book from retail or the shelf of a library? And here is the long answer greatly simplified. When a work is added to a model, the work is ingested and weights in the underlying neural network(s) of the model are adjusted. Billions of these weights (a subset of parameters) are adjusted, usually very slightly. Well, couldn’t we just un-adjust them? No, because the amount the weights are moved is affected by how early in the process the work is added, assuming all works are added sequentially (not quite true). Could we make an adjustment after the fact by adjusting for the opposite of what re-adding the work would add? Sure, but that isn’t a provably sound mathematical operation and likely isn’t even sound.

So, no, you can’t untrain an LLM on your work that trained the base LLM. Not until next time.

What is the Complaint?

The general complaint from intellectual property (IP) owners, attorneys, and advocates is that works protected by copyright were used without permission to train LLMs. They were not formally and explicitly asked. They did not have an opportunity to say no. True dat, all dat. This opens a can of worms. I will eat a few worms, one at a time.

Johnny is adorable and loving, but has been known to eat some worms.

What Rights Does Copyright Provide?

The first worm is the actual rights that copyright provides. From the United States Copyright Office’s What is Copyright page, here’s their list pared down to written works:

  • Reproduce the work in copies.
  • Prepare derivative works based on the work.
  • Distribute copies to the public by sale or other transfer of ownership or by rental, lease, or lending.
  • Perform the work publicly.
  • Display the work publicly.

“Copyright also provides the owner of copyright the right to authorize others to exercise these exclusive rights, subject to certain statutory limitations.”

In United States federal copyright law, there are no general “moral rights” like in the French / European tradition. VARA does not apply here. As there is no actual copy present, it should be pretty clear that a copyright claim against an LLM developer rests on the LLM being a derivate work.

As a copyright owner, you have no enumerated right to restrict anyone from writing a review or critique of your work. Similarly, you you have no enumerated right to restrict anyone’s use of your work in training an LLM if no illicit copy is made, retained, or distributed in that process. The training process is neither a performance nor a display.

Side note: the Digital Millennium Copyright Act (DMCA) may also apply to works that are distributed with non-trivial mechanism that prevents copying of the digital version of the content.

Could an LLM Possibly Contain a Copy?

Prove this true, and you win your case. Game over. The challenge is, there is no body and very limited circumstantial evidence that there was even a murder. It’s more like a disappearance with very plausible deniability, to continue the metaphor.

For example, most LLM developers would (or should) rather work with well written summaries and reviews of books than the unabridged books themselves. See above about wanting linguistic and cultural coverage to fit in tight space constraints. If you write books. your particular work isn’t very important to them. In fact, it’s probably a huge pain in the @$$ for them.

Is there even a recoverable copy? The New York Times complaint against OpenAI alleges in counts 80 and 81 that OpenAI’s models exhibit a behavior they call “memorization”.

The higher the lossy compression ratio, the less likely this behavior can be demonstrated. That’s basic information theory. OpenAI may have exposed themselves by not being super lossy enough, but smaller LLMs intended for deployment with limited resources are far less likely to exhibit this behavior.

Does an LLM Create Derivative Works?

This one is going to surprise you. Derivative works are created, and they are often delightful. But the LLM doesn’t do so on it’s own. These derivative works can only be created in response to a prompt from the user. You’ve got a tough hill to climb anyway claiming the the LLM mechanically generating words based on probabilities is creative. Dismissing the push start and genius in the prompt makes the climb steeper.

But let’s go back to information theory and do a little mathematical thought experiment. There is a concept in cryptography called a one-time pad. I’ll motivate it with an example. You and a friend pick your favorite Metallica CD, clearly St. Anger since you are real fans. You rip the CD onto your respective computers and use its digital representation to encode short messages between you. If you have a 2 kilobyte message, you take the next 2K of St. Anger digital and “add” each byte in order to the bytes of your message. Your friend receives the encoded message, gets his next 2K of St. Anger digital and “subtracts” each byte to recover the message.

While this is clearly a bad plan for spies, it’s also a worse plan than you can imagine for the two of you. A third party comes along with their own St. Anger CD and very nefarious illegal content. They encode the content with bytes starting from a random location in the St. Anger digital bitstream and send it to both of you. Now you and your friend are in possession of very nefarious illegal content that neither of you created. All the third party has to do is give the authorities a St. Anger CD and a byte offset. You don’t even know what you’re in possession of because you don’t have the byte offset!

You can think of an LLM as similar to the St. Anger CD. If I ask it to (algorithmically) generate a story about characters and storylines from books that are still protected by copyright, a clearly derivative work will almost always be produced. But I’m the one who actually produced it, not the LLM. Without my prompt, the LLM can’t even get started.

Side note: The IP maximalist side does not and would not favor granting copyright to the LLM for a non derivative work it was prompted to create. For example: “Make up a story about a lumberjack named Jack Lumber and a melancholy grey donkey named Donkey Oatie saving the forest with 7zip.” My private LLM just knocked this one out of the park. Do I own that idea if I publish a generated story?

To more accurately answer the question titling this section, I inspired the creation of the derivative work. I am responsible for it being created, not the LLM, and not the company that created the LLM. The LLM ran an algorithm to generate the words. The LLM is, at worst, an accessory to the crime, and even that’s probably not in your best interests as a copyright holder.

Is the Derivative Work a Violation of Copyright?

If I publish the derivative work of materials protected by copyright that I created with the help of an LLM, I am clearly in violation of copyright statutes. I might get sued.

It’s a good time to bring up Fair Use, a legal defense that tempers overzealous copyright enforcement in the United States. Here is the four factor test that courts are supposed to apply. The Authors Guild v. Google case appears, on surface, that it might be a relevant precedent for IP and copyright actions against OpenAI relating to written words. A district court found that Google did not violate copyright by putting its Google Books service online. Google’s use was protected by Fair Use. The Supreme Court refused to hear an appeal. Similar cases had been folded in along the way. The legal process took 10 years.

What if the Derivative Work is for my Private Use?

If I don’t publish a derivate work based on your material protected by copyright, can you sue me? You can sue anybody. But I’ve done no damage to you. I haven’t negatively affected the market for your work, except my own demand for it. I’m not buying your books again if you sue me. Duh.

Writing my own fan-fiction stories, with or without an LLM, for my own private use, is clearly Fair Use. In fact, it’s an activity we all learned on the playgrounds in elementary school, role playing as our favorite cartoon, TV, movie, and even literary characters as we burnt off energy during recess.

You can’t cut off availability of LLMs because someone might use them to produce unauthorized fan fiction any more than you can restrict the sale of sharpening stones because someone might fabricate a sword. Or, even more to the point, ban guns because someone might kill someone else. Legally and politically, despite Herculean efforts, that argument has not actually worked in the United States. If it doesn’t work for guns, there is no chance it will work for speech.

Let’s be very clear. Copyright is a restriction on unfettered speech. It is a societal agreement that individuals should enjoy a monopoly on speech they create, with rights assigned and enforced for a limited time. We should always be skeptical of (especially overly-) aggressive claims of ownership of works, characters, concepts, and culture. The last time we blanket extended copyright was in 1998 with the Sonny Bono Copyright Term Extension Act. Sonny didn’t even vote for it because a tree got in his way 10 months prior.

Were Illegal Copies Made or Used During the Training of an LLM?

Probably. There is no room for that in version next, next + 1, etc. for any LLM creator, from OpenAI down to the smallest open source models. The training processes have to be tight, even auditable. The permission layer has to be tighter. Opt-out might not be enough. A fortunate property of LLMs is that they are very expensive in dollars and GPU time to encode. Verifying copyright permission on training material will be expensive, but a relatively minor expense comparatively.

Remember the key goals for most LLMs: language fluency, cultural knowledge, high scores on standard metrics. Those have to be maximized while fitting into dollar and storage budgets.

None of those require your particular copyright protected work. Absent essentially injunctive legislative relief, judicial review will take years, maybe a decade. The early iterations of LLMs that did actually infringe on your copyright will be long forgotten. More than likely, the extent of actual market damage, will be easy to see as well. It will be zero. Books will still sell. Authors will still get paid. Statutory damages will seem petty and meaninglessly punitive, considering that training processes have been cleaned up. As the cases are resolved, you might even hope to have your works included in training, so that your contribution to the culture isn’t lost and forgotten.

Side note: OpenAI, in its public statement about Media Manager, mentions the robots.txt file that has served for decades now as the informal contract between content producers and scrapers of the web. If we grant the pro-IP side leeway because they never imagined their content would be scraped for training of an AI system, we have to take a harsh view if they’ve made any effort at search engine optimization (SEO) on those same pages. SEO yells, “Pick me! Pick me! Make me more important than other content!” They’re trying to include your content on what they envision will be the new front door to the Internet. You don’t get it both ways.

This is the only part of my general e/acc bias I’m going to reveal. My specific biases aren’t so supportive, as you should have gleaned from reading. I’m done eating worms for now. I have run out of hot sauce. I hope you can see that there is an another side to the IP in AI debate, at least in LLMs for language content. I hope you can see that it is a very difficult debate.

Finally, as I write this article, I am still looking for my next professional adventure. If my thinking matches a need in your organization, I’d love to hear from you.

Drop by Brad-GPT.com and reach out to me on the form. Or, drop me a note at brad@Brad-GPT.com.

#WrittenByMe

--

--

Brad Hutchings
Brad Hutchings

Written by Brad Hutchings

Founder of DemoMachine.net. Write it downer of things. Guitar player. Comedian. Future guitar playing comedian.