Home Tech Microsoft Unveils VALL-E, Audio AI That Can Simulate Any Voice From 3-Second...

Tech

Microsoft Unveils VALL-E, Audio AI That Can Simulate Any Voice From 3-Second Prompts

January 10, 2023

Microsoft researchers recently announced VALL-E, a new text-to-speech AI model that can accurately mimic a person’s voice when given a three-second audio sample. Once it has learned a specific voice, VALL-E can synthesise audio of that person saying anything—while attempting to retain the speaker’s emotional tone. When combined with other generative AI models like GPT-3, VALL-E’s creators believe it can be used for high-quality text-to-speech applications, speech editing in which a recording of a person could be edited and altered from a text transcript (making them say something they did not actually say), and audio content creation.

According to Microsoft, VALL-E is primarily a “neural codec language model,” and is based on EnCodec, which Meta revealed in October 2022. VALL-E creates discrete audio codec codes from text and acoustic prompts, as opposed to other text-to-speech methods that typically synthesise speech by manipulating waveforms. It processes how a person sounds, breaks the relevant data down into discrete components (referred to as “tokens”) using EnCodec, and then uses training data to match what it “knows” about how that voice might sound if it spoke other phrases beyond the three-second sample.

Microsoft trained VALL-E’s speech synthesis functionalities using Meta’s LibriLight audio library. It includes 60,000 hours of English language speech from over 7,000 speakers, sourced primarily from LibriVox public domain audiobooks. The voice in the three-second sample should closely resemble a voice in the learning algorithm for VALL-E to produce a good result.

The American technology giant offers dozens of audio examples of the AI model in action on the VALL-E example website. The “Speaker Prompt” data set is the three-second audio given to VALL-E that it must try to emulate. The “Ground Truth” is a previously recorded version of that same speaker saying a specific phrase for comparative purposes (sort of like the “control” in the experiment). The “Baseline” sample is generated by a traditional text-to-speech synthesis method, and the “VALL-E” sample is generated by the VALL-E model.

A block diagram of VALL-E as shown in the example website by Microsoft researchers
Photo Credit: Microsoft

Researchers only supplied the three-second “Speaker Prompt” sample and a text string (what they would want the voice to say) into VALL-E to get those results. Some VALL-E results appear computer-generated, but others could be misunderstood for human speech, which is the model’s goal. Because of VALL-E’s potential to fuel wrongdoings and deceit, Microsoft has not made VALL-E code available for others to explore. The researchers appear to be aware of the potential social harm that this technology may cause.

They write in the paper’s conclusion: “Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models.”

Affiliate links may be automatically generated – see our ethics statement for details.

Catch the latest from the Consumer Electronics Show on Gadgets 360, at our CES 2023 hub.

Poco C55 Tipped to Be a Rebranded Redmi 12C, Expected to Launch Soon

Featured video of the day

CES 2023: MSI Creator Laptops Updated, Pen 2 Stylus Announced, and More

Source link

Facebook
Twitter
Pinterest
WhatsApp

Previous article2022 MTN MoMo Awards
Next articleEllen DeGeneres shares raging flood video at California home: ‘This is crazy’ – National

Ghana News

Microsoft Unveils VALL-E, Audio AI That Can Simulate Any Voice From 3-Second Prompts

MOST READ NEWS

Joining Lyon was the right choice – Ernest Nuamah

Bridge that posed a death trap in Amasaman fixed

Inside war-torn Sudan where people are trapped in prison of urban warfare | World...

Social Media Reacts As Yvonne Nelson Reveals ‘I Am Not Yvonne Nelson Part II’...

Yvonne Nelson says Akuapem Poloo is nice and matured

I Will Give KiDi My Liver – Ayisha Modi Drops Bombshell...

80-year-old former Minister grabs 31-year-old as 9th wife

Business tycoon in Novrongo to bury late father in a car...

DWP dancer quits dancing due to her religious beliefs

CePPEA Commends Finance Minister For Navigating Ghana Through Tough Times |...

Protracted strike: CETAG pulls out of meeting with Education Ministry over...

Minister of Information inspires HSFC BECE Students with Empowering Mentorship Session

Healthcare delivery now closer to people in Adaklu Wumenu Electoral Area

Mahama-Said-NPP-Will-Never-Elect-A-Northerner-As-Flagbearer-But-Here-I-Am-Vote-For-Me—Bawumia-Tells-Jirapa-Residents

EVEN MORE NEWS

FUTSAL: Futsal Premier League Week One Standings and fixtures for Match...

Bank of Ghana keeps policy rate at 29%

Have You Graduated Before? Jon – Stonebwoy Fires Shots at Shatta...

POPULAR CATEGORY