OpenAI’s GPT store caught using copyright materials

For the past few months, Morten Blichfeldt Andersen has dedicated countless hours to meticulously searching through OpenAI’s GPT Store. Since its inception in January, this marketplace for custom AI bots has become populated with an array of both practical and whimsically unique tools. From cartoon generators capable of producing New Yorker-style illustrations and anime scenes to programming and writing aids that streamline the creation of code and text, the variety is vast. Among the eclectic mix are also a colour analysis tool, a spider identification bot, and a dating coach named RizzGPT. Yet, Andersen’s focus has been singular: identifying bots that utilize his employer’s copyrighted textbooks without authorization.

Andersen serves as the publishing director at Praxis, a Danish textbook publisher that has not only embraced AI but also developed its specialized chatbots. Currently, he finds himself embroiled in a relentless effort within the GPT Store, acting as the frontline in identifying copyright infringements.

“I’ve been actively searching for infringements and filing reports,” Andersen shares. “But it seems to be just the beginning.” He suspects that the majority of these infractions are the work of students who upload content from textbooks to create custom bots for sharing amongst their peers, indicating that what he has discovered so far might only be a small fraction of the violations present in the GPT Store.

Identifying bots that potentially use copyrighted content based on their descriptions is straightforward, as highlighted by a recent report that criticized the GPT Store for being cluttered with “spam.” While the use of copyrighted material without permission is acceptable in certain contexts, it often leads to legal action by the rights holders. Several GPTs claim to mimic the styles of well-known authors, suggesting they might utilize copyrighted materials, such as a bot designed to write in the manner of George R.R. Martin, another that emulates Margaret Atwood, and one that purports to capture the essence of Stephen King.

Attempts to uncover the underlying data (known as the “system prompt“) that these bots use for their responses have revealed that some can reproduce material from copyrighted works verbatim, suggesting direct access to these texts.

OpenAI’s spokesperson has stated that the company addresses takedown requests for bots created with copyrighted content but has not provided specific details on the frequency of such actions. The company also employs a mix of automated systems, human review, and user reports to identify and evaluate potential policy violations, including unauthorized use of third-party content.

The presence of copyright issues in the GPT Store adds another layer to OpenAI’s existing legal challenges. The company is already facing lawsuits over allegations of using copyrighted material without permission for training its AI models, including actions by prominent news outlets and a collective of fiction and nonfiction authors.

Chatbots in the GPT Store, while based on OpenAI’s technology, are developed by external creators for specific purposes. Developers can augment the bot’s capabilities by uploading additional information, a process known as retrieval-augmented generation (RAG). Andersen is convinced that the RAG files for many bots contain copyrighted materials uploaded without consent.

The terms of service for the GPT Store clearly prohibit the use of third-party content without necessary permissions, yet verifying if copyrighted material has been used by developers is a challenge for rights holders. This has led Andersen to employ keywords and engage with suspected bots directly to determine if they have been trained on Praxis’s copyrighted works.

The broader legal battles regarding the use of copyrighted material to train AI may be prolonged, but disputes over content in the GPT Store could have more immediate consequences. According to copyright law, platforms like the GPT Store that allow user-uploaded content are subject to specific regulations that enable copyright holders to file complaints against unauthorized use of their intellectual property.

Upon discovering instances of infringement, Andersen has filed DMCA takedown notices, which initially went unanswered until he sought assistance from the Danish Rights Alliance, a group dedicated to protecting the rights of creative professionals in Denmark. This organization has been proactive in addressing copyright violations in the AI domain, leading to the removal of infringing bots from the GPT Store.

There’s a call for more efficient mechanisms that enable rights holders to search for and identify unauthorized uses of their content. Meanwhile, startups are emerging with solutions designed to help AI companies detect and manage potential copyright infringements in their outputs.

Some experts argue that the principle of fair use could protect the development of GPTs that rely on copyrighted works for educational and research purposes. However, without clear visibility into what developers are uploading, rights holders face the daunting task of manually investigating each suspicious bot, akin to searching for a needle in a haystack.