How to stop your data from being used to train AI

If you’ve ever shared anything online—a tweet, a blog post, a review, or an Instagram selfie—chances are it’s been used to train today’s generative AI models. These large language models, like ChatGPT, and image generators rely on vast amounts of our data. Even beyond chatbots, this data is utilized for various machine-learning applications.

Tech giants have scoured the web extensively to gather the data they deem necessary for developing generative AI, often disregarding content creators, copyright laws, and privacy concerns. Moreover, companies with access to extensive user-generated content are increasingly exploring opportunities to capitalize on the AI boom by selling or licensing such data. Reddit, for instance, falls into this category.

However, amidst the growing number of lawsuits and investigations surrounding generative AI and its opaque data practices, there have been some incremental steps towards granting individuals more control over their online content. Some companies now offer options for individuals and business clients to opt out of having their content utilized for AI training or commercial purposes. Here’s what you need to know about what you can—and can’t—do.

ChatGPT

Web users: Go to Settings and uncheck “Improve the model for everyone.”
Logged-in web users: Select ChatGPT, Settings, Data Controls, and turn off Chat History & Training.
Mobile app users: In Settings, choose Data Controls, and turn off Chat History & Training.

Dall-E 3

Utilize the provided form to request the removal of images from “future training datasets.”

Quora

Visit the Settings page, go to Privacy, and turn off “Allow large language models to be trained on your content.”

Perplexity

Click on your account name, navigate to the Account section, and turn off the AI Data Retention toggle.

WordPress

In your website’s dashboard, click on Settings, General, then through to Privacy, select the Prevent third-party sharing box.

Your Website

Update your website’s robots.txt file to exclude AI crawlers. Add a disallow command to prevent scraping by AI bots.

Adobe:

If you store your files in Adobe’s Creative Cloud, the company may use them to train its machine-learning algorithm.
For personal Adobe accounts, opt out by opening Adobe’s privacy page, scrolling down to the Content analysis section, and toggling off the option.
Business or school accounts need to contact their administrator as the opt-out process is not available on the individual level.

Setting realistic expectations is crucial. Many AI companies have already amassed vast datasets from the web, meaning that most of what you’ve posted is likely already in their systems. Furthermore, these companies tend to be secretive about their data acquisition methods and usage policies. Niloofar Mireshghallah, a researcher focusing on AI privacy at the University of Washington, emphasizes the lack of transparency in these processes, describing them as “very black-box.”

Navigating the opt-out process can be complex, with companies often making it challenging to exercise this option. Many users may not have a clear understanding of the permissions they’ve granted or how their data is being utilized. Additionally, various legal considerations, such as copyright laws and privacy regulations, further complicate matters. Despite these challenges, some companies are beginning to offer opt-out mechanisms for future data scraping or sharing activities, albeit often requiring users to opt in by default.

Thorin Klosowski, a security and privacy activist at the Electronic Frontier Foundation, highlights how companies often introduce friction into the opt-out process, banking on the fact that many users won’t actively seek out these options. He contrasts this with an opt-in approach, where users actively choose to participate, emphasizing the importance of informed consent.

While the majority of this guide focuses on opt-out options for text-based content, artists have also been leveraging platforms like “Have I Been Trained?” to signal that their images should not be used for training purposes. Run by startup Spawning, this service enables individuals to check if their creations have been scraped and opt out of future data collection. Jordan Meyer, cofounder and CEO of Spawning, underscores the platform’s flexibility, allowing users to opt-out of any media type via its browser extension.

In addition to text-based content, AI companies also leverage audio data for training purposes. Rev, a voice transcription service, utilizes both human freelancers and AI to transcribe audio, using this data perpetually and