Microsoft Voice AI: When Technology Goes Beyond Magic
Artificial intelligence tools are becoming increasingly important in our world today, and their importance will only continue to grow in the near future — there is a long road ahead, for sure. Many jobs that were once performed by people can now be automated with the help of AI, reducing costs and increasing efficiency. This can include literally all kinds of tasks, in all kinds of industries, and ours is no exception. Artificial intelligence is having a huge and growing impact on audio and sound design, as it can be used to analyze and synthesize sounds, as well as to improve the quality and accuracy of sound design, among many other aspects.
For example, one of the most common applications of AI in sound design is, without a doubt, the creation of synthetic sounds. AI can be used to create audio that mimics the human voice, as well as musical instruments, and others, which can be difficult and expensive to create in a traditional way. To that extent, AI can analyze and synthesize such sounds faster and better than any sound designer. This is precisely what we want to talk about today because there is an AI tool created by Microsoft called Vall-E capable of emulating a person’s voice from an audio sample of just three seconds: a disruptive technology from any point of view.
Vall-E is essentially a “neural codec” language model that allows you to generate audio from text descriptions and short audio samples. This means that, in a very short time, it can analyze a person’s voice, break that information into individual parts, and use a large amount of data it has to compare what the voice would sound like if it uttered sentences other than the sample itself. It can even mimic the timbre of the speaker’s voice and the emotional tone of the speech by reproducing the acoustics of the room. For instance, if the sample is taken from a telephone conversation, the audio output will simulate the acoustic and frequency characteristics of the telephone.
Vall-E opens up a whole range of application possibilities, but like other systems that have dominated the social conversation in recent months, it carries risks of misuse. One of them, perhaps the most obvious, is that not only will scams, forgery to manipulate public opinion, and other crimes be possible (considering that many people are not aware of technologies like this…), but also that the voice will no longer be a password. Every day, due to the advancement of technologies such as AI, biometrics get more obstacles to being secure and reliable. Perhaps for this reason, Microsoft did not send the mockup code, aware of the potential damage the technology could cause, such as identity theft.
The use of these tools also raises ethical questions about intellectual property, authorship, and the right to privacy of a person’s voice, not to mention the concerns this raises in terms of data protection and privacy, as well as in terms of the authenticity and trustworthiness of voice-generated information and content.
However, everything is changing, and it is impossible to stop that train, and security is also reinventing itself. It will possible, for example, to build a detection model to discriminate whether an audio clip was synthesized by Vall-E, just as it is possible to identify a deep fake or a fake photo.
Vall-E is still under construction, though, and all indications are that it will be a long road ahead. To optimize the model, Microsoft plans to expand its training data to improve performance on prosody, speech style, and speaker similarity. It also ensures that it explores ways to reduce missing or unclear words in the audio.
Regardless, audio-generating artificial intelligence tools, such as Microsoft’s Vall-E, can have important implications for the audio industry and sound design, especially in the field of music production and audiovisual content creation. Such tools can be very useful for the production of music and sound effects, as they allow sound designers and composers to create sounds that would otherwise be difficult to produce. In addition, these tools can also be used to create synthetic voices and effects for video games, movies, and TV series, which could increase the efficiency and speed of audiovisual content production.
In the future, we will have to get used to the presence and use of this technology. Everything will be transformed, of course. The key here will be how we use it and for what purposes, as well as how much we give it what it lacks: our humanity, our creativity, and our pursuit of excellence. AI, more than threatening, can be complementary.
If you need advice on this and other sound design issues or need expert help to take your audiovisual production to the next level of quality, don’t hesitate to contact Enhanced Media Sound Studio. We will be happy to create a masterpiece with you.