Zoom call, Meeting in the Metaverse All virtual events have the potential to be improved in the future, thanks to a set of AI models developed by engineers. MetaAccording to the company, it matches sound with images and mimics the way humans experience sound in the real world.
Developed in collaboration with researchers at the University of Texas at Austin, the three models are known as visual-acoustic matching, visual reverberation elimination, and Visual Voice. Meta has made the model available to developers.
“For example, there is a big difference in the sound of a concert in a large venue and a concert in the living room. This is due to the shape of the physical space, the material and surface of the area, and the proximity of where the sound is heard. This is because it affects how the voice is heard. “
Meta’s new audio AI model
The Visual Acoustic-Matching model can take an audio clip recorded anywhere, along with an image of a room or other space, and convert the clip so that it sounds like it was recorded in that room.
An example of this use case is to make people in the video chat experience sound the same. So if you have one at home, the other at a coffee shop, and the third at the office, you can adjust the sound to sound as if you were in a sitting room.
Visually-Informed Dereverberation is the opposite model, focusing on getting sound and visual cues from space and removing reverberation from space. For example, you can focus on violin music, even if it was recorded at a large train station.
Finally, the VisualVoice model uses visual and audio cues to separate the sound from other background sounds and sounds, allowing the listener to focus on a particular conversation. It can be used in large conference halls where many people are mixed.
You can also use this focused audio technology to generate higher quality subtitles and make the audio output easier to understand when multiple people are talking in future machine learning, Explained Meta.
Audio improvements in virtual experiences
Rob Goodman, a music reader at the University of Hertfordshire and an expert in acoustic space, said: Tech Monitor This work leads to the need of humans to understand where we are in the world and brings it to virtual settings.
“We have to think about how humans perceive sound in their environment,” says Goodman. “Human wants to know where the sound comes from, how big the space is, and how small the space is. When we hear the sound being made, we hear a few different things. One is the sound source, but when combined with the room, what happens to the sound, that is, listen to the sound. “
Properly capturing and mimicking that second aspect can make virtual worlds and spaces look more realistic and eliminate the disconnects that humans may experience if the visuals don’t exactly match the audio. He explains that he can.
An example of this is a concert playing outdoors, but the actual audio is recorded inside the cathedral and has considerable reverb. The reverb is unexpected on the beach, so the sound and visual mismatch is unexpected and offensive.
According to Goodman, the biggest change is how listener perceptions are considered when implementing these AI models. “We need to think carefully about the location of the listener,” he says. “Sound closer to a person is more important than a few meters away. It’s based on the speed of sound in the air, so a slight delay in the time it takes to reach a person is very important.”
He said some of the problems with improving audio were a lack of end-user equipment, and users “spend thousands of pounds on curved monitors, but never pay more than £ 20 for a pair of headphones.” explained.
Under development by Professor Mark Plumbley, AI EPSRC Fellow for Sound at Sally University Different types of sound classifiers Therefore, it can be deleted or highlighted in the recording. “If you want to create this realistic experience for people, you need to match your vision with the sound,” he says.
“It’s harder for a computer than I think. When we’re listening, it helps us focus on the sound from someone in front of us and ignore the sound from the side. There is an effect called directional marking.
This is what we are used to doing in the real world, says Plumbury. “If you’re attending a cocktail party with lots of conversations, you can focus on the conversations you’re interested in. You can block out sounds from the side and elsewhere,” he says. “This is difficult to do in a virtual world.”
He states that much of this work has been brought about by changes in machine learning and has better deep learning techniques that work in a variety of areas, such as voice and image AI. “Many of these are related to signal processing,” Plumbley adds.
“Sound, gravitational waves, time series information from financial data, etc. They are about signals that come over time. So far, researchers have built individual ways for different types of objects to extract different things. We now have to discover that deep learning models can elicit patterns. “