Imagine the following scenario. A phone rings. An office worker answers it and hears his panicked boss tell him that he forgot to wire money to the new contractor before he left for the day and needs to. She gives him the wire transfer information and with the money transferred, the crisis is averted.
The worker sits back in his chair, takes a deep breath and watches his boss walk in the door. The voice on the other end of the call was not his boss. In fact, he wasn’t even human. The voice he heard was that of an acoustic deepfake, a machine-generated sound sample designed to sound exactly like his boss.
Attacks like this using recorded audio have already happened, and deep chat audio spoofs may not be far off.
Deepfakes, both audio and video, have only been possible with the development of sophisticated machine learning technologies in recent years. Deepfakes have brought with them a new level of uncertainty about digital media. To detect deep fakes, many researchers have turned to analyzing visual artifacts – small glitches and inconsistencies – found in deep fake videos.
Deep audio fakes are potentially an even bigger threat because people often communicate verbally without video – for example, through phone calls, radio and voice recordings. These voice-only communications greatly expand the possibilities for attackers to use deepfakes.
To detect deep audio fakes, we and our research colleagues at the University of Florida developed a technique that measures the acoustic and fluid dynamic differences between voice samples generated organically by human speakers and those generated synthetically by computers.
Organic vs. synthetic voices
People shout by forcing air into the various structures of the vocal tract, including the vocal cords, tongue, and lips. By rearranging these structures, you change the acoustic properties of your vocal tract, allowing you to create over 200 distinct sounds or phonemes. However, human anatomy fundamentally limits the auditory behavior of these different phonemes, resulting in a relatively small range of correct sounds for each.
Instead, audio deepfakes are created by first allowing a computer to listen to recordings of a targeted victim’s speaker. Depending on the exact techniques used, the computer may need to listen to 10 to 20 seconds of audio. This audio is used to extract key information about the unique aspects of the victim’s voice.
The attacker selects a phrase for the deepfake to speak and then, using a modified text-to-speech algorithm, creates an audio sample that sounds like the victim is saying the selected phrase. This process of creating a single fake audio sample can be completed in a matter of seconds, potentially allowing attackers enough flexibility to use the deep fake voice in a conversation.
Fake audio detection
The first step in differentiating human-produced speech from deep-fake speech is understanding how the vocal tract is acoustically modeled. Fortunately, scientists have techniques to estimate what someone—or someone like a dinosaur—would sound like based on anatomical measurements of their vocal tract.
We did the opposite. By reversing many of these same techniques, we were able to extract an approximation of a speaker’s vocal tract during a segment of speech. This allowed us to effectively examine the anatomy of the speaker who produced the audio sample.
From here, we hypothesized that deep false sound samples could not be constrained by the same anatomical constraints that humans have. In other words, the analysis of deep false sound samples simulate vocal tract forms that do not exist in humans.
Our test results not only confirmed our hypothesis but revealed something interesting. When extracting vocal line estimates from deep false audio, we found that the estimates were often comically wrong. For example, it was common for deep falsetto to result in vocal tracts of the same relative diameter and consistency as a straw, unlike human vocal tracts, which are much wider and more varied in shape.
This realization demonstrates that spurious audio, even when convincing to human listeners, is far from distinguishable from human-generated speech. By assessing the anatomy responsible for generating the observed speech, it is possible to determine whether the sound was generated by a person or a computer.
Because this matters
Today’s world is defined by the digital exchange of media and information. Everything from news to entertainment and conversations with loved ones usually happen through digital exchanges. Even in their infancy, deepfake video and audio undermine the trust people have in these exchanges, effectively limiting their usefulness.
If the digital world is to remain a critical resource for information in people’s lives, efficient and secure techniques for determining the source of an audio sample are vital.
Logan Blue, PhD in Computer Science & Information Technology & Engineering, University of Florida and Patrick Traynor, Professor of Computer Science and Information Technology and Engineering, University of Florida
This article is republished from The Conversation under a Creative Commons license. Read the original article.