The public web and consent
This blog exists on the public web, and that means you as a user have a lot of control over both how you consume my work and what you can do with it. I publish to the web, so I suppose the canonical way of reading my work is to do it in a web browser, but that’s just the tip of the iceberg.
You could also choose to block my ad or to mess with my CSS. You could choose to read entirely in your RSS reader and never come to the site at all. You can save this to the read later service and read it on their site or in their app. You can download a local copy of anything on the site and do whatever you want with it. Search engines can index it and show my site to people looking stuff up on Google. And yes, LLMs can scrape my site to use it as food for their training.
I bet that everyone reading this was nodding along like, “yes, this is what’s so great about the open web!” up until that last one — then we probably had a split in opinions. I’ve seen other people I like and respect weigh in on this topic and people have pretty strong opinions going in opposite directions. I, meanwhile, find myself unable to place myself perfectly on either side of the argument.
On the one hand, I do believe that posting things on the public web using open standards comes with the understanding that my work will be freely available to anyone to consume in any way they want. For my entire life, this has been the appeal of the web, and people like me have been cheering software that builds on this public and open nature to let users use web content in ways the authors never intended. Hell, Listen Later (full disclosure, Listen Later is a current sponsor of Comfort Zone, but not this blog) lets you turn this blog post into a podcast, so you don’t even need to read the text, you can listen to it in Overcast if you want. That was never my intention when creating this site, but I think it’s pretty rad that it’s possible and I don’t have any ill will towards people who do that (this is often how I proofread my posts, actually). So when companies like OpenAI, Anthropic, or Apple use my site to train their models, I can see that as an extension of all the other ways I let people use my site how they want.
On the other hand, I sympathize with the ire some other people have about how it feels to have their work included in training these LLMs. As I’ve said many times before, just because something is technically legal doesn’t mean that you have to like it. Maybe you dislike LLMs at a fundamental level and the idea that your work is helping enable them makes you queasy. Or maybe you’re concerned with the plagiarism these tools enable and worry about them replacing your work as search engines like Google, Arc Search, and Perplexity plagiarize your work and present it as their own.
My sheepish attempt to explain my current feelings
I think that publishing my writing to the open, public web means I am empowering others to do what they want with that writing. That means the non-controversial stuff like reading in RSS apps and turning my posts into podcasts, but it also means tech companies using my work to build their own products. To my eyes, this is the nature of publishing on the web, and the same freedom I give people to do things I like with my content also gives people the ability to do things I don’t necessarily like.
On the LLM front, I’m not particularly bothered by my writing being used to help train GPT, Claude, or Apple Intelligence. While I appreciate others feel differently, I just don’t see these tools as replacing me in any real way. On the other hand, tools like Arc Search, Perplexity, and Google’s AI answers are trying to replace me and present my work as their own. That’s plagiarism and copyright infringement, and I think those products can fuck right off.
The big sticking point here is consent. I don’t think it’s moral for a company to go “we’ve trained on your data already, but if you don’t want us to do that again, you can put a rule in your robots.txt
file.” That sucks, and anyone at the companies building these LLMs who say this is fine know that they’re lying to justify their bad behavior.
But let’s say you are a good server admin who doesn't want to participate in LLM training and you update your robots.txt
file to block every LLM bot you know of. As we said above, you can’t block new bots since they don’t announce themselves until after they’ve done all their initial training, and also that file isn’t law, it’s just a suggestion. As we’ve seen in recent days, Perplexity doesn’t give a shit about that file and will plow right through any guardrails you’ve tried to set up.
Let’s pivot for a second to RSS. This blog has a full-text RSS feed, so you can easily use that feed to collect everything I post into another app without ever visiting my site. You could read my site in your reader app or you could create a script that automatically creates a backup of everything I post for your own personal needs.
But if I decided that I didn’t want to allow that, I could change how my feed works and truncate my posts so that you had to come to my site if you wanted to see my writing. Unlike the robots.txt
solution for LLM bots, changing my RSS feed actually blocks the behavior I’m trying to prevent. It’s also a measure that blocks any future tools from accessing my full post content, even if I don’t know what those tools are. Similarly, I want a way for web admins to be able to block all LLM bot access with one command. Don’t make me fill out a new restriction for each bot, let me say “this site is off limits for all LLM bots,” and have that be enforced with no reasonable workaround for sleazy companies to circumvent. I don’t know how this would be done technically, but I think it should exist.
In the meantime, I think it's reasonable for different people to feel differently about this whole situation. The web has been around for a long time, and LLMs are something very new that we need to figure out how to handle. The only people I reject are those who suggest those who feel differently to them as being irrational.