AudioHijack: adversarial audio attacks on generative voice models transfer from open weights to Microsoft and Mistral production systems

News

Interesting new research you may have heard of on attacking large audio language models. The attack is called AudioHijack and the part worth paying attention to is that adversarial clips built against open models transferred to commercial Microsoft and Mistral systems sharing the same architecture. OpenAI and Anthropic are harder targets but the team thinks shared open-source audio encoders are a viable path in, and they’re working on it. The manipulations are shaped to sound like natural reverberation instead of added noise, so you can’t really hear them. Threat model only requires controlling the audio the model processes, not the user’s prompt. So: poisoned YouTube clips, music, voice notes, Zoom audio fed to transcription, and the team also says they’ve gotten this working against live voice chats in real time (unpublished). Six attack categories demonstrated. Refusing user requests, returning false info, inserting malicious links, swapping persona, claiming it can’t process audio, and triggering unauthorized tool use. On the technical side, two things stood out to me. First, generative audio models tokenize the input, which kills the fine-grained gradient signal older adversarial audio work relied on, so they approximated it. Second, they explicitly hijack the attention mechanism by scoring how much attention the model pays to the adversarial audio vs. the user instruction and feeding that back into the optimization. Defenses are where it gets bleak. Few-shot prompting with examples of malicious instructions cut attack success by 7%. Self-reflection caught 28%. Monitoring internal attention patterns was the only thing that actually worked, and an attacker who knows about it can dial back the attention manipulation and take a small hit to success rate to evade it. Microsoft acknowledged the work and pointed at developer-side mitigations. Mistral didn’t respond. Text prompt injection at least leaves visible artifacts. Audio doesn’t, and we don’t really have a good story for this yet. Thoughts? submitted by /u/snackymann [link] [comments]Technical Information Security Content & DiscussionRead More