Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
Published in Arxiv Preprint, 2017
Abstract
Large Language Models (LLMs) are vulnerable to ‘Jailbreaking’ prompts, a type of attack that can coax these models into generating harmful content. In this paper, we show that pruning up to 20% of the attention layer parameters with WANDA (Sun et al.) markedly increases their resistance to such attacks without fine-tuning and with negligible loss of performance in standard benchmarks. Intriguingly, we discovered that the safety gains post-pruning correlate to the initial safety training level of the model. This suggests WANDA pruning has a regularizing effect, which we reproduced with statistical significance for linear models. To systematically evaluate safety, we curate a dataset of 225 harmful tasks across five categories, inserted into ten different Jailbreaking prompts. Our analysis shows that pruning aids LLMs in concentrating attention on task-relevant tokens in jailbreaking prompts. This approach to safety is orthogonal and complementary to existing adversarial defense methods.
Recommended citation: **Adib Hasan**, Ileana Rugina, & Alex Wang. (2024). Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning. https://arxiv.org/abs/2401.10862