The Register

Boffins devise voice-altering tech to jam ‘vishing’ schemes

Researchers based in Israel and India have developed a defense against automated call scams.

ASRJam is a speech recognition jamming system that uses a sound modification algorithm called EchoGuard to apply natural audio perturbations to the voice of a person speaking on the phone. It’s capable of subtly distorting human speech in a way that baffles most speech recognition systems but not human listeners.

The tech is needed because recent advances in machine learning, text-to-speech (TTS), and automatic speech recognition (ASR) have made it quite easy to automatically make phone calls with the intent to scam or defraud.

These “vishing” attacks – like email-based phishing but using voice instead of text – see criminals and scammers use TTS to create a realistic-sounding voice that speaks words they hope will lure victims. If the recipient of a call responds, the crook’s ASR system attempts to convert their vocal response to text, so the back-end model can decipher what was said, devise a response, and conduct a conversation long enough to elicit sensitive information or prompt the victim to take an action.

Vishing increased 442 percent between the first and second half of 2024, according to CrowdStrike’s 2025 Global Threat Report. During the first half of that year, the US Federal Trade Commission said that using AI-generated voices for phone calls is illegal.

As Crystal Morin, former intelligence analyst for the US Air Force and cybersecurity strategist at infosec vendor Sysdig, told The Register in December 2024, voice-based phishing is becoming harder to detect as AI models get better.

Freddie Grabovski (Ben-Gurion University of the Negev), Gilad Gressel (Amrita Vishwa Vidyapeetham), and Yisroel Mirsky (Ben-Gurion University of the Negev) have come up with a defense against vishing, described in a pre-print paper titled “ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams.”

They argue that the ASR component of the scammers’ setups represents the weakest link.

“Our key insight is that by disrupting ASR performance, we can break the attack chain,” they explain in their paper. “To this end, we propose a proactive defense framework based on universal adversarial perturbations, carefully crafted noise added to the audio signal that confuses ASR systems while leaving human comprehension intact.”

The researchers say they believe they’re the first to propose a proactive defense against automated voice scams that’s practical enough to deploy.

ASRJam defends against vishing by running the EchoGuard algorithm in real time on end-user devices. The tool is invisible to attackers, making it more difficult to circumvent.

EchoGuard is also universal – it works, to varying degrees, against any AI model. It is also zero-query, meaning it doesn’t require sample ASR output to generate an audio perturbation capable of breaking the ASR model.

The authors say that while other ASR jamming techniques have been proposed over the past few years (AdvDDoS, Kenansville, and Kenku), “none are suitable for interactive scenarios; their perturbations, though often intelligible, are perceptually harsh and impractical for interactive scenarios.”

ASRJam is better, they argue, because EchoGuard modifies the voice in three ways: reverberation, microphone oscillation, and transient acoustic attenuation.

By altering sound reflection characteristics, simulating microphone positioning changes, and subtle sound shortening, the researchers claim their method “strikes the best balance between clarity and pleasantness,” based on a survey they conducted with an unspecified number of participants.

They’ve published a website that includes an original speech sample and copies that have been processed with EchoGuard and other algorithms for comparison.

The researchers evaluated ASRJam/EchoGuard and the other techniques against three public datasets (Tedlium, SPGISpeech, and LibriSpeech) and six ASR models (DeepSpeech, Wav2Vec2, Vosk, Whisper, SpeechBrain, and IBM Watson).

“Across the board, EchoGuard consistently outperforms all baseline jammers,” the authors state in their paper. “Our method achieves the highest attack success rate on every ASR system tested, across all datasets, with only one minor exception: SpeechBrain (SB), where it is slightly outperformed by the others.”

The authors say they consider this acceptable since SpeechBrain isn’t common in real-world deployments and its performance isn’t great for general ASR systems.

They also note that all the automatic speech recog jamming techniques tested underperform against OpenAI’s Whisper model, which they suggest is better at filtering out adversarial noise because developers trained it on a particularly large set of data that included a lot of noisy samples.

Nonetheless, EchoGuard does better than the other jammers against Whisper.

“Importantly, while the absolute attack success rate on Whisper may seem modest (e.g., 0.14 on LibriSpeech), this still implies that 1 in 6 transcriptions is significantly corrupted, a degradation level that could be sufficient to disrupt scam conversations, especially in the context of interactive dialogue where misrecognition of key terms or intents can derail an LLM’s generation,” they claim.

Lead researcher Grabovski told The Register that he believes future work will improve how ASRJam and EchoGuard perform against Whisper.

“ASRJam is currently a research project, but we’re actively working on improvements with the goal of commercializing it in the near future,” he said. ®

READ MORE HERE