As artificial intelligence has just begun to upend much of internet (and normal) life, many are raising ethical questions over how companies developing AI source the data that are used for training this software. To address these concerns, both OpenAI and Google have taken steps to provide publishers with an option to exclude their content from being used to train AI bots.
Web Publisher Concerns about AI Bots
Publishers are right to be concerned about the role that their content has in training AI, and they might be concerned for a few different reasons.
Content Copyright and Earnings
Creators and publishers have a right to earn revenue from the content that they make available. Whoever has the copyright should benefit from the use of their content. This raises two specific concerns for publishers.
First, companies developing artificial intelligence programs are using publishers’ content without compensating them. Although this is a previously uncommon use, training programs are a way that content can be used. Publishers should, therefore, have control over whether they’ll allow this (and perhaps if they’d charge).
"unlawfully copied and processed millions of images protected by copyright"
- Getty Images lawsuit
This is exactly what Getty Images, one of the largest online photo and video providers, has charged OpenAI with. Getty Images claims their 12 million images were used “without permission…or compensation.” The lawsuit includes multiple examples of images that feature a blurred Getty Images watermark.
An additional lawsuit by Getty Images claims Stability AI "unlawfully copied and processed millions of images protected by copyright", with examples of files produced with AI-altered Getty logos.
Photo comparison featured on The Verge
Publisher Industry Changes Brought by AI
Some publishers may view AI as a threat within their industry. Even if they accept that their business model will have to eventually change because of AI’s capabilities, they may not want to accelerate the software’s development.
While preventing AI companies from accessing a specific publisher might have a negligible effect on development, some publishers might object to this based on principle.
Protecting Unique Content
A few publishers may hope to keep their content unique by preventing AI from potentially copying it (or making something similar). This isn’t a new challenge for online publishers, as scrapers have long been used to gather data from websites. It’s another facet that could be relevant in highly specialized niches or for news platforms, however.
Options to Opt-Out of AI Training
Without regulation, publishers must manually opt out of each AI company’s development. The two main ones to opt out of are OpenAI (creator of ChatGPT) and Google (which has Bard and Vertex AI).
Some within the online publishing industry see this as a nominal option, with one executive stating: “It’s a symbolic gesture…I think it was kind of a wasted effort on my part. It’s an inevitability that this stuff is ingested and crawled and learned from.”
Nonetheless, publishers do now have the option to opt-out.
How to Opt-Out of ChatGPT
Certain sites don’t have to worry about OpenAI’s crawler gathering information from their content.
The company says it doesn't gather data from content that’s behind a paywall or a form requesting personal information. It also doesn’t crawl sites that aren’t aligned with OpenAI’s content guidelines. All of these are filtered out automatically.
Publishers who have content that’s not precluded automatically (which includes most publishers) can block the GPTBot by adding basic code to their website’s robots.txt file.
The GPTBot is identified within a robots.txt file as:
User-agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36
(KHTML, like Gecko; compatible; GPTBot/1.0;
+https://openai.com/gptbot)
To block the GPTBot altogether, add the following to your site’s robots.txt file:
User-agent: GPTBot
Disallow: /
To selectively block the GPTBot from specific content, use the following example to select which folders can and can’t be accessed:
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
This is much like blocking Google or another search engine’s crawlers from accessing a website or folders.
How to Opt-Out of Google Bard
Google made an opt-out available for its Bard AI and Vertex AI in September 2023. The opt-out is executed in much the same way as OpenAI’s opt-out.
To block Google’s AI crawler, add the following code to your site’s robots.txt file:
User-agent: Google-Extended
Disallow: /
As with OpenAI’s bot, you could also give Google some but not total access:
User-agent: Google-Extended
Allow: /directory-1/
Disallow: /directory-2/
"They treat it all as one big search product."
- Matt Rogerson, The Guardian
Before opting out, webmasters and publishers should be aware that this will likely mean a site isn’t crawled for search indexing either. As Matt Rogerson of The Guardian put it, these are “bundled scrapers.” He explained: “They treat it all as one big search product. They’re like, ‘No, you don’t get the granularity choice. We give you the opportunity to opt out.’ But obviously, we don’t want to opt out of all web crawling.”
Block AI Training Bots from Your Content
This solution isn’t perfect. It only addresses two AI developers thus far (e.g. not Microsoft), and all companies in this field have already scraped vast amounts of data. As Google has written, “As AI applications expand, web publishers will face the increasing complexity of managing different uses at scale.”
These are two simple actions that webmasters and online publishers can take, however.
If you’re an online publisher and concerned about how your content could be used for AI training, take these two simple actions to block OpenAI’s Chat GPT, Google’s Bard, and Google’s Vertex AI from accessing your website.
Interested in How AI is being Used by Publishers?
Here are some additional articles about AI for digital publishers:
- How Major Media Pubs are Using AI for Content Production
- 6 Ways AI Can Grow Subscriptions
- Megalist of AI Tools for Publishers
- More articles about AI for Publishers
- 100+ ChatGPT Prompts for AdOps and Media Publisher Roles
Admiral's Visitor Relationship Management (VRM) solution leverages AI and machine learning in multiple ways and continues to innovate tools to automate the growth of visitor relationships and revenue. Examples include the integration of ChatGPT to automate CTA generation to drive conversions, and real-time triggers based on visitor traffic spikes with Surge Targeting.
Find out how VRM can help you drive relationships and revenue across the visitor journey.