No-training Creative Commons

Tantek Çelik proposes a “CC-NT” license, for “no-training”:

This seems like an obvious thing to me. If you can write a license that forbids “commercial use”, then you should be able to write a license that forbids use in “training models”, which respectful / well-written crawlers should (hopefully) respect, in as much as they respect existing CC licenses.

I like this. There are fair use and copyright issues to sort out in the courts, but in the meantime we should be using robots.txt and Creative Commons wherever possible. On my blog, I allow any crawling and any use with attribution. Others might prefer to block AI bots and restrict to non-commercial use, or even allow commercial use but not for AI training.

There was a great episode of Decoder last week with The Browser Company’s Josh Miller. Nilay Patel and Josh talk about the open web, browsers of course, and AI. One comment near the end from Nilay stood out to me, where he said AI training gives “nothing” back to writers on the web.

Wait, nothing? Integrating my blog posts into a model with essentially all the world’s information, so that people can ask it questions and have my writing also included with the answers… That’s “nothing”? Personally, I don’t make money directly from my blog. There are countless benefits to blogging. In the age of AI, one of those benefits is now letting me contribute in a small way to something bigger, in the same way that someone finds an answer in one of my blog posts when they search on Google.

The trade-off is different for everyone. Subscription and ad-based publishers are rightly concerned. They should make deals with AI companies, or in some cases block bots outright. Some people will block or use CC-NT on principle alone. No problem. For me, I hope my writing reaches as far as it can, and so letting it get slurped up by our future AI overlords is not just acceptable, I want it to happen. It’s not nothing.

Sam (Satyajit) Grover

CC was designed so the content creator could indicate their intention of diluting the default (copyright). “CC-NT” doesn’t make sense because the default covers it.

Manton Reece

@samgrover My reading of CC-NT would be something unique to default copyright: you can use the text for any purpose you want, including commercial use, but you can’t train LLMs with it. Not sure if that’s exactly what Tantek intended, though.

Sam (Satyajit) Grover

Maybe the intention is to create an NT clause that could be applied to an existing license, e.g. one could do CC-BY-NT to allow companies to use with attribution but not for training, and CC-BY-NC-NT would allow a researcher to use for any purpose except training. That makes more sense.

Sam (Satyajit) Grover

@dvdlite Right, but the default I’m referring to is one where a creator doesn’t use a CC license, or any other. In that case copyright offers all the protections including “no training”, IMHO. Of course, the companies using it for their profit disagree with that, and we’ll see how the lawsuits go.

Manton Reece

@samgrover Yep, that sounds right to me.

Sam (Satyajit) Grover

@dvdlite Ah, yea, with that stance, they really ought to provide an NT option.

Manton Reece @manton
Lightbox Image