[ad_1]
Technologies
Our groundbreaking speech generation technologies help people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.
Language is central to human connection. It helps people around the world exchange information and ideas, express emotions and create mutual understanding. As our technology to create natural, dynamic voices continues to improve, we are unlocking richer, more engaging digital experiences.
In recent years, we have pushed the boundaries of audio generation, developing models that can produce high-quality, natural speech from a range of inputs such as text, pacing, and specific voices. This technology enables single-speaker audio in many Google products and experiments – including Gemini Live, Project Astra, Journey Voices and YouTube Auto-Sync – and helps people around the world with more natural, conversational and intuitive digital assistants and AI. tools to interact.
Working with partners at Google, we recently helped develop two new features that can generate long, multi-speaker dialogues to make complex content more accessible:
- NotebookLM Audio Overviews transforms uploaded documents into engaging and lively dialogues. With a click, two AI hosts summarize user material, make connections between topics, and banter back and forth.
- Illuminate creates formal, AI-generated discussions of research papers to make knowledge more accessible and digestible.
Here we provide an overview of our latest research on speech generation, which underlies all of these products and experimental tools.
Pioneering audio generation techniques
We have been investing in audio generation research for years, exploring new ways to create more natural dialogue in our products and experimental tools. In our previous research on SoundStorm, we first demonstrated the ability to generate 30-second segments of natural dialogue between multiple speakers.
This extended our previous work SoundStream and AudioLM, which allowed us to apply many text-based language modeling techniques to the problem of audio generation.
SoundStream is a neural audio codec that efficiently compresses and decompresses an audio input without compromising its quality. As part of the training process, SoundStream learns how to map audio to a series of acoustic tokens. These tokens capture all the information needed to reconstruct the audio material with high fidelity, including properties such as prosody and timbre.
AudioLM treats audio generation as a language modeling task to generate the acoustic tokens of codecs such as SoundStream. Therefore, the AudioLM framework makes no assumptions about the type or composition of the audio produced and can flexibly handle a variety of sounds without the need for architectural adjustments – making it a good candidate for modeling multi-speaker dialogues.
Building on this research, our latest speech generation technology can generate 2 minutes of dialogue with improved naturalness, speaker consistency and acoustic quality when a dialogue script and speaker change markers are provided. The model also performs this task in under 3 seconds on a single Tensor Processing Unit (TPU) v5e chip in an inference pass. This means audio is generated more than 40 times faster than real time.
Scaling our audio generation models
Scaling our single-speaker generation models to multi-speaker models then became a matter of data and model capacity. To help our latest speech generation model produce longer speech segments, we developed an even more efficient speech codec that compresses audio into a sequence of tokens at just 600 bits per second without compromising the quality of the output.
The tokens generated by our codec have a hierarchical structure and are grouped by time frame. The first tokens within a group capture phonetic and prosodic information, while the last tokens encode fine acoustic details.
Even with our new voice codec, creating a two-minute dialogue requires generating over 5000 tokens. To model these long sequences, we developed a special Transformer architecture that can efficiently process information hierarchies and matches the structure of our acoustic tokens.
This technique allows us to efficiently generate acoustic tokens corresponding to dialogue within a single autoregressive inference pass. Once generated, these tokens can be decoded back into an audio waveform using our voice codec.
To teach our model to produce realistic exchanges between multiple speakers, we pre-trained it on hundreds of thousands of hours of speech data. We then refined it using a much smaller dialogue dataset with high acoustic quality and accurate speaker annotations, consisting of unscripted conversations from a range of voice actors and realistic inconsistencies – the ums and aahs of a real conversation. This step taught the model to reliably switch between speakers during generated dialogue and output only studio-quality audio with realistic pauses, tones, and timing.
Consistent with our AI principles and our commitment to the responsible development and deployment of AI technologies, we integrate our SynthID technology to watermark non-volatile, AI-generated audio content from these models to protect against potential misuse of this technology to protect.
New language experiences are coming
We're now focusing on improving the speech intelligibility and acoustic quality of our model, and adding more granular controls for features like prosody. At the same time, we are exploring how best to combine these advances with other modalities such as video.
The potential applications for advanced speech generation are enormous, especially when combined with our Gemini family of models. From improving learning experiences to making content more widely accessible, we look forward to continuing to push the boundaries of what is possible with voice-based technologies.
Acknowledgments
Authors of this work: Zalán Borsos, Matt Sharifi, Brian McWilliams, Yunpeng Li, Damien Vincent, Félix de Chaumont Quitry, Martin Sundermeyer, Eugene Kharitonov, Alex Tudor, Victor Ungureanu, Karolis Misiunas, Sertan Girgin, Jonas Rothfuss, Jake Walker and Marco Tagliasacchi .
We thank Leland Rechis, Ralph Leith, Paul Middleton, Poly Pata, Minh Truong, and RJ Skerry-Ryan for their critical efforts on dialogue data.
We are very grateful to our employees in Labs, Illuminate, Cloud, Speech, and YouTube for their excellent work in turning these models into products.
We also thank Françoise Beaufays, Krishna Bharat, Tom Hume, Simon Tokumine and James Zhao for their assistance with the project.
[ad_2]
Source link