OpenAI mystery: How OpenAI uses YouTube videos to train Sora

The AI community widely speculates that OpenAI has utilized a substantial volume of YouTube videos to train its models, including the recent Sora project. This practice is somewhat of an open secret, despite Google’s strict rules against scraping or downloading videos from YouTube for commercial purposes. The tech giant actively thwarts large-scale data retrieval attempts, a challenge lamented on forums like GitHub and Reddit, where users report exceedingly slow download times for even a single video.

RELATED: OpenAI unveiled a new AI model, Sora, that transforms text into a 60-second video

Given the immense need for diverse data, including text, images, and video, to train AI models, OpenAI must have navigated around these obstacles to access or download a significant amount of content from YouTube. When queried about this, OpenAI remarked, “Sora’s training included material from licensed sources as well as publicly available content from the internet,” sidestepping direct questions regarding the scale of YouTube video downloads and Google’s limitations.

The surge in generative AI technology has led to an intense scramble for high-quality data. The legal, ethical, and best practices in this arena remain murky, though accessing YouTube videos in potentially policy-violating ways is not necessarily illegal. The doctrine of “fair use” and extensive case law provide some protection for using online content in various ways, with ongoing debates about whether using copyrighted material for AI training falls under legal use.

In an environment eager for data, AI companies, including OpenAI, operate under a veil of secrecy regarding their data acquisition methods. Comparisons have been drawn to the e-commerce sector, where companies routinely scrape competitors’ pricing information despite such practices being formally against many service terms. This tacit mutual tolerance reflects the unresolved nature of data scraping ethics and legality in the burgeoning field of AI.

RELATED: It’s confirmed that OpenAI’s Sora text-to-video generator will be publicly available this year

The practice of disclosing training data sources in research papers has dwindled as competition heats up, leaving many questions unanswered. When The Wall Street Journal inquired if OpenAI had used YouTube videos to train Sora, CTO Mira Murati responded, “I’m not actually sure about that.” Further pressed for details on training data sources, Murati remained elusive, stating, “I’m not going to go into the details,” highlighting the secretive and competitive landscape of AI development.