DeepMind, Google’s AI research lab, is pioneering a new AI technology designed to create soundtracks for videos. This innovation, named V2A (video-to-audio), is being positioned as a crucial component in the evolving landscape of AI-generated media. Despite significant advancements in video-generating AI models, including those developed by DeepMind, the challenge remains that these models cannot produce synchronized sound effects.
“Video generation models are advancing at an incredible pace, but many current systems can only generate silent output,” DeepMind stated in an official blog post. “V2A technology [could] become a promising approach for bringing generated movies to life.”
DeepMind’s V2A technology operates by using a description of a soundtrack (for example, “jellyfish pulsating under water, marine life, ocean”) paired with a video to produce music, sound effects, and dialogue that match the video’s characters and tone. This is enhanced with DeepMind’s deepfakes-combating SynthID technology. The AI model driving V2A, a diffusion model, was trained on a diverse set of sounds, dialogue transcripts, and video clips.
“By training on video, audio, and additional annotations, our technology learns to associate specific audio events with various visual scenes, while responding to the information provided in the annotations or transcripts,” explained DeepMind. DeepMind has not disclosed whether any of the training data was copyrighted or if the data creators were informed about the use of their content. Clarification has been sought from DeepMind on this matter.
AI-powered sound-generating tools are not new. Recent examples include Stability AI‘s release and ElevenLabs‘ launch in May. Similarly, models that generate video sound effects have been developed by Microsoft, which can produce talking and singing videos from still images, and platforms like Pika and GenreX that generate music or effects for videos.
However, DeepMind claims that V2A technology is unique because it can understand the raw pixels of a video and automatically sync generated sounds with the video, even without a description. Despite its potential, V2A technology has its limitations. The model was not extensively trained on videos with artifacts or distortions, resulting in lower-quality audio for such videos. Generally, the generated audio lacks a natural sound quality, with critiques describing it as a “smorgasbord of stereotypical sounds.”
To prevent misuse, DeepMind has decided against releasing the technology to the public in the near future. “To ensure our V2A technology has a positive impact on the creative community, we’re gathering diverse perspectives and insights from leading creators and filmmakers and using this feedback to guide our ongoing research and development,” DeepMind noted. Before considering broader public access, the technology will undergo rigorous safety assessments and testing.
DeepMind envisions V2A technology as particularly beneficial for archivists and individuals working with historical footage. However, the broader implications of generative AI in the film and TV industry pose significant challenges. Ensuring that such tools do not eliminate jobs or entire professions will require strong labour protections and careful consideration of the technology’s impact.
In summary, while DeepMind’s V2A technology holds promise for enhancing AI-generated media, its development and deployment must be handled with caution to avoid unintended negative consequences.