Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment
Published in EMNLP, 2025
We propose Attention Eclipse, the first jailbreak technique that directly manipulates attention mechanisms to bypass safety alignment in large language models. Our method achieves a 91.2% attack success rate on LLaMA-2, reduces generation cost by 66%, and demonstrates strong cross-model transferability.
This work reveals a previously unexplored vulnerability at the architectural level of LLMs and highlights the critical role of attention in model alignment and safety. Our findings suggest that alignment defenses must consider internal attention dynamics rather than relying solely on surface-level prompt filtering.
