Are you worried that as artificial intelligence seems to advance quickly, it could endanger more human jobs? And now for some news that might make those worries grow. Microsoft researchers have unveiled a brand-new AI that can faithfully imitate a human voice from just a brief three-second audio sample.
Vall-E, a voice AI tool from Microsoft, is trained on 60,000 hours of speech—100 times more than current systems—from more than 7,000 speakers, the majority of whom are from LibriVox public domain audiobooks. It is also trained on “discrete codes derived from an off-the-shelf neural audio codec model.”
Microsoft have announced their AI "VALL-E"— Tuvok @ NaughtyDog (@TheCartelDel) January 7, 2023
Using a 3-second sample of human speech, it can generate super-high-quality text-to-text speech from the same voice. Even emotional range and acoustic environment of the
sample data can be reproduced. Here are some examples. pic.twitter.com/ExoS2VWO6d
According to Ars Technica, Vall-E is based on a technology called EnCodec, which Meta unveiled in October 2022. It functions by listening to a person’s voice, dissecting the data into its constituent parts, and applying its training to simulate how the voice would sound if it were speaking various phrases. Vall-E can mimic a speaker’s timbre and emotional tone even after only hearing a three-second sample.
According to the research paper, which is available at Cornell University, “Experiment results demonstrate that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system [AI that recreates voices it has never heard] in terms of speech naturalness and speaker similarity. We also discover that VALL-E is capable of preserving the emotional state of the speaker as well as the acoustic context of the acoustic prompt in synthesis.
On GitHub, Vall-E voice recreation samples can be heard. Despite being based on such a small audio sample, many are genuinely amazing, sounding almost exactly like the speaker. Although some of them sound a little more robotic and more like conventional text-to-voice software, overall, the AI is still quite impressive, and we can expect it to get better over time.
Concerns have been raised about Vall-potential E’s abuse, as there are with all AIs. Political impersonation is one example, especially when Deepfakes are used in conjunction with it. Or it might trick people into thinking they are communicating with friends, family, or authorities and divulging private information. Additionally, some security systems employ voice identification. Regarding the effect on employment, Vall-E would probably be less expensive than using voice actors.
The risks associated with Vall-E misuse, according to the researchers, can be reduced. “To determine whether an audio clip was created by Vall-E, a detection model can be created. When advancing the models, we’ll also take Microsoft AI Principles into consideration.”