
Yes, Apple is also training on public web data

Posted by Matt Birchler
— 1 min read

From Apple’s Machine Learning Research blog: Introducing Apple’s on-Device and Server Foundation Models

We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot. Web publishers have the option to opt out of the use of their web content for Apple Intelligence training with a data usage control.

It would seem Apple is training their model similarly to OpenAI and Google by using some licensed content, but also using (likely) quite a bit of data they’ve scraped from the web. It’s good they’re going so much on-device, and it’s great that they’ve indicated their servers running this stuff are powered by fully clean energy (I believe this was said in last night’s The Talk Show, which is not released yet), but if you had a concern with how LLMs were trained on public data, then you’ll still be annoyed by Apple’s methods.

If you own a website and want to block them from doing any more training on your site going forward, the linked post does have instructions on how to block Apple’s bot.