Stability AI revela modelo ‘Stable Audio’ para geração de áudio controlável
Stability AI has introduced “Stable Audio,” a latent diffusion model designed to revolutionise audio generation. This breakthrough promises to
Stability AI has introduced “Stable Audio,” a latent diffusion model designed to revolutionise audio generation.
This breakthrough promises to be another leap forward for generative AI and combines text metadata, audio duration, and start time conditioning to offer unprecedented control over the content and length of generated audio—even enabling the creation of complete songs.
Audio diffusion models traditionally faced a significant limitation in generating audio of fixed durations, often leading to abrupt and incomplete musical phrases. This was primarily due to the models being trained on random audio chunks cropped from longer files and then forced into predetermined lengths.
Stable Audio effectively tackles this historic challenge, enabling the generation of audio with specified lengths, up to the training window size.
One of the standout features of Stable Audio is its use of a heavily downsampled latent representation of audio, resulting in vastly accelerated inference times compared to raw audio. Through cutting-edge diffusion sampling techniques, the flagship Stable Audio model can generate 95 seconds of stereo audio at a 44.1 kHz sample rate in under a second utilising the power of an NVIDIA A100 GPU.
A sound foundation
The core architecture of Stable Audio comprises