AI crawling reprise
Jeremy Keith has a good collection of links and quotes about AI crawling. On this specific part of the commentary I continue to disagree, though:
If you’re using the products powered by these attacks, you’re part of the problem. Don’t pretend it’s cute to ask ChatGPT for something. Don’t pretend it’s somehow being technologically open-minded to continuously search for nails to hit with the latest “AI” hammers.
I don’t think we should paint all AI tools with the same brush. Some tools might be well-behaved crawlers and some might not be. Note that this is a separate question from the legality of AI training. The context is mostly Wikipedia which is not under copyright.
Simon Willison adds about the Wikipedia data:
There’s really no excuse for crawling Wikipedia (“65% of our most expensive traffic comes from bots”) when they offer a comprehensive collection of bulk download options.
Ben Werdmuller also sees these as bad actors:
Here the issue is vendors being bad actors: creating an enormous amount of traffic for resource-strapped services without any of the benefits they might see from a real user’s financial support.
The argument I’m hearing from some folks is that because they consider AI to be bad, everything it touches must also be bad. All crawling, whether it respects robots.txt or not. All tools, because using them contributes to the success of LLMs.
I’d like to have more concrete answers, such as: do ChatGPT and Claude respect robots.txt or not? I assume they do, because they document their user agent strings. If they do, it doesn’t seem fair to punish ChatGPT because there is some other rogue AI crawler that is misbehaving.
AI is powerful and potentially dangerous. Because of this, most users will gravitate toward “brands” that are respected and accountable. In other words, users will prefer Apple Intelligence, ChatGPT, or Claude, where we know there has been some level of safety work, with only fringe users downloading and running models from other sources.
These mainstream AI tools should be contributing back. We know ChatGPT has a deal with Reddit, but they should also be making a recurring donation to Wikipedia. This would further differentiate the well-behaved bots from the ones skirting the edges of fairness.
Meta appears to have used their old move fast and break things playbook to training Llama, using pirated books. From The Atlantic:
Meta employees turned their attention to Library Genesis, or LibGen, one of the largest of the pirated libraries that circulate online. It currently contains more than 7.5 million books and 81 million research papers. Eventually, the team at Meta got permission from “MZ”—an apparent reference to Meta CEO Mark Zuckerberg—to download and use the data set.
Another thing that’s puzzling to me is if AI bots are so abusive, why haven’t I felt this in Micro.blog? We host blogs. If bots were destroying the open web, I would expect to notice it on my own servers.
If you dislike generative AI on philosophical grounds, or because of specific negative side effects such as energy use, that is totally fine. But then let’s stick to those arguments. I’m not yet convinced that legitimate AI crawling is going to destroy blogs or even Wikipedia.