Nvidia launches Fugato AI model that can generate music, voices and sound effects

Nvidia on Monday introduced a new artificial intelligence (AI) model that can generate different types of audio and mix different types of sounds. The tech giant calls the Foundation model Fugato, short for Foundational Generative Audio Transformer Opus 1. While audio-focused AI platforms like Beatoven and Suno exist, the company highlighted that Fugato offers users granular control over the desired output. The AI model can generate or transform any combination of music, voice and sound based on specific signals.

Nvidia introduces AI audio model Fugato

In a blog post, the tech giant explained in detail about its new large language model (LLM). Nvidia said that Fugatto can generate musical fragments, remove or add instruments from an existing song, change the accent or emotion in a voice, and “even let people generate such sounds. which have never been heard before.”

The AI model accepts both text and audio files as input, and users can combine the two to refine their requests. Under the hood, the architecture of the Foundation model is based on the company’s previous work in speech modeling, audio vocoding, and audio understanding. Its full version uses 2.5 billion parameters and was trained on the dataset of Nvidia DGX systems.

Nvidia highlighted that the team building Fugato collaborated with various countries globally, including Brazil, China, India, Jordan, and South Korea. The collaboration of people of different ethnicities has also contributed to developing the multi-accent and multilingual capabilities of the AI model, the company said.

Talking about the capabilities of the AI audio model, the tech giant highlighted that it has the ability to generate audio output types on which it was not previously trained. Highlighting an example, Nvidia said, “Fugato can make a trumpet bark or a saxophone meow. “Whatever the user can describe, he can model.”

Additionally, Fugatto can add specific audio capabilities using a technology called ComposableART. With it, users can ask the AI model to generate audio of a person speaking French with a sad expression. Users can also control the degree of sadness and heaviness of pronunciation with specific instructions.

In addition, the Foundation Model can also generate audio with temporal interpolation, or sounds that change over time. For example, users can generate thunderstorm sounds with loud thunder that fades into the distance. These sound scenarios can also be experimented with, and even if it’s sounds the model has never processed before, it can create them.

As of now, the company has not shared any plans to make the AI models available to users or enterprises.

Source link