The artificial intelligence speech company ElevenLabs has recently shared insights into its future plans, revealing an exciting development: the integration of sound effects into AI-generated video content for the first time. Known for its advanced text-to-speech and synthetic voice capabilities, ElevenLabs is now expanding its offerings by incorporating artificially generated sound effects into videos made with OpenAI’s Sora.
OpenAI launched its groundbreaking Sora text-to-video AI model last week, demonstrating some of the most realistic, consistent, and extensive AI-generated videos yet seen. In response, ElevenLabs announced its intention to enhance these videos with a rich array of sound effects, including footsteps, waves, and ambient sounds, although its text-to-sfx model is not yet available for public use. The company expressed its enthusiasm on X, stating: “We were blown away by the Sora announcement but felt it needed something… What if you could describe a sound and generate it with AI?“
Established in 2022, ElevenLabs quickly emerged as a leader in creating highly realistic synthetic voices, capable of generating speech nearly indistinguishable from natural human voices. The U.K.-based startup achieved unicorn status earlier this year, following an $80 million Series B funding round, highlighting its innovative tool for synchronizing AI speech in videos for automatic translations and targeting the global dubbing market.
While there are existing text-to-sfx models in the industry, developed by companies like myEdit, AudioGen, and StabilityAI’s Stable Audio, ElevenLabs’ sounds are noted for their exceptional naturalness, though the extent of editing required remains unclear.
ElevenLabs has not specified a launch date for its text-to-sfx technology but has initiated a waitlist for interested users, inviting them to submit prompts for sound creation. This innovation hints at a future where AI tools could autonomously analyze video content and apply appropriate sound effects and music at precise moments, transitioning from current text-to-music approaches to more integrated multimodal capabilities.
The vision for generative AI is to enable the creation of comprehensive content from a simple prompt. Although this goal remains largely aspirational at present, advancements in technologies like text-to-sfx, AI-enhanced video, and synthetic voice are gradually bringing this ambitious dream within reach.