Pushing the boundaries of audio creation

[ad_1]

Technologies

Published
Authors

Zalán Borsos, Matt Sharifi and Marco Tagliasacchi

An illustration showing speech patterns, iterative progress in dialogue generation, and a relaxed conversation between two voices.

Our groundbreaking speech generation technologies help people around the world interact with more natural, conversational and intuitive digital assistants and AI tools.

Language is central to human connection. It helps people around the world exchange information and ideas, express emotions and create mutual understanding. As our technology to create natural, dynamic voices continues to improve, we are unlocking richer, more engaging digital experiences.

In recent years, we have pushed the boundaries of audio generation, developing models that can produce high-quality, natural speech from a range of inputs such as text, pacing, and specific voices. This technology enables single-speaker audio in many Google products and experiments – including Gemini Live, Project Astra, Journey Voices and YouTube Auto-Sync – and helps people around the world with more natural, conversational and intuitive digital assistants and AI. tools to interact.

Working with partners at Google, we recently helped develop two new features that can generate long, multi-speaker dialogues to make complex content more accessible:

  • NotebookLM Audio Overviews transforms uploaded documents into engaging and lively dialogues. With a click, two AI hosts summarize user material, make connections between topics, and banter back and forth.
  • Illuminate creates formal, AI-generated discussions of research papers to make knowledge more accessible and digestible.

Here we provide an overview of our latest research on speech generation, which underlies all of these products and experimental tools.

Pioneering audio generation techniques

We have been investing in audio generation research for years, exploring new ways to create more natural dialogue in our products and experimental tools. In our previous research on SoundStorm, we first demonstrated the ability to generate 30-second segments of natural dialogue between multiple speakers.

Xem thêm  FACTS Grounding: A new benchmark for assessing the facticity of large language models

This extended our previous work SoundStream and AudioLM, which allowed us to apply many text-based language modeling techniques to the problem of audio generation.

SoundStream is a neural audio codec that efficiently compresses and decompresses an audio input without compromising its quality. As part of the training process, SoundStream learns how to map audio to a series of acoustic tokens. These tokens capture all the information needed to reconstruct the audio material with high fidelity, including properties such as prosody and timbre.

AudioLM treats audio generation as a language modeling task to generate the acoustic tokens of codecs such as SoundStream. Therefore, the AudioLM framework makes no assumptions about the type or composition of the audio produced and can flexibly handle a variety of sounds without the need for architectural adjustments – making it a good candidate for modeling multi-speaker dialogues.

Example of a multi-speaker dialogue generated by NotebookLM Audio Overview and based on some potato-themed documents.

Building on this research, our latest speech generation technology can generate 2 minutes of dialogue with improved naturalness, speaker consistency and acoustic quality when a dialogue script and speaker change markers are provided. The model also performs this task in under 3 seconds on a single Tensor Processing Unit (TPU) v5e chip in an inference pass. This means audio is generated more than 40 times faster than real time.

Scaling our audio generation models

Scaling our single-speaker generation models to multi-speaker models then became a matter of data and model capacity. To help our latest speech generation model produce longer speech segments, we developed an even more efficient speech codec that compresses audio into a sequence of tokens at just 600 bits per second without compromising the quality of the output.

Xem thêm  I hosted a game night with the help of AI and here's how you can do the same

The tokens generated by our codec have a hierarchical structure and are grouped by time frame. The first tokens within a group capture phonetic and prosodic information, while the last tokens encode fine acoustic details.

Even with our new voice codec, creating a two-minute dialogue requires generating over 5000 tokens. To model these long sequences, we developed a special Transformer architecture that can efficiently process information hierarchies and matches the structure of our acoustic tokens.

This technique allows us to efficiently generate acoustic tokens corresponding to dialogue within a single autoregressive inference pass. Once generated, these tokens can be decoded back into an audio waveform using our voice codec.

Animation showing how our speech generation model autoregressively generates a stream of audio tokens that are decoded back into a waveform consisting of a two-speaker dialogue.

To teach our model to produce realistic exchanges between multiple speakers, we pre-trained it on hundreds of thousands of hours of speech data. We then refined it using a much smaller dialogue dataset with high acoustic quality and accurate speaker annotations, consisting of unscripted conversations from a range of voice actors and realistic inconsistencies – the ums and aahs of a real conversation. This step taught the model to reliably switch between speakers during generated dialogue and output only studio-quality audio with realistic pauses, tones, and timing.

Consistent with our AI principles and our commitment to the responsible development and deployment of AI technologies, we integrate our SynthID technology to watermark non-volatile, AI-generated audio content from these models to protect against potential misuse of this technology to protect.

Xem thêm  What's Synthetic Frequent Intelligence? Can AI suppose like people?

New language experiences are coming

We're now focusing on improving the speech intelligibility and acoustic quality of our model, and adding more granular controls for features like prosody. At the same time, we are exploring how best to combine these advances with other modalities such as video.

The potential applications for advanced speech generation are enormous, especially when combined with our Gemini family of models. From improving learning experiences to making content more widely accessible, we look forward to continuing to push the boundaries of what is possible with voice-based technologies.

Acknowledgments

Authors of this work: Zalán Borsos, Matt Sharifi, Brian McWilliams, Yunpeng Li, Damien Vincent, Félix de Chaumont Quitry, Martin Sundermeyer, Eugene Kharitonov, Alex Tudor, Victor Ungureanu, Karolis Misiunas, Sertan Girgin, Jonas Rothfuss, Jake Walker and Marco Tagliasacchi .

We thank Leland Rechis, Ralph Leith, Paul Middleton, Poly Pata, Minh Truong, and RJ Skerry-Ryan for their critical efforts on dialogue data.

We are very grateful to our employees in Labs, Illuminate, Cloud, Speech, and YouTube for their excellent work in turning these models into products.

We also thank Françoise Beaufays, Krishna Bharat, Tom Hume, Simon Tokumine and James Zhao for their assistance with the project.

[ad_2]

Source link

By

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *