RejectShield: Memory Makes The Poison

Abstract

TL;DR: Over-memorization in LVLMs significantly amplifies their vulnerability to visual poisoning attacks. We introduce RejectShield, a rejection-based defense that disrupts memorization and reduces attack success by up to 99%.

Large Vision-Language Models (LVLMs) excel across tasks, yet their safety and security remain underexplored. Among emerging threats, LVLM poisoning attacks pose a serious threat by inducing targeted hallucinations in fine-tuned LVLMs. Although effective, the root cause of these attacks remains poorly understood. The attack is originally justified as being effective due to the carefully injected visual perturbations to fine-tuning data, which subtly manipulate the model. Consequently, existing defenses rely on state-of-the-art (SOTA) purification methods, but these have shown ineffective so far.

In this work, we argue that this gap stems from a more fundamental issue: a limited understanding of LVLM vulnerabilities during fine-tuning. To address this, we systematically study the fine-tuning process and, for the first time, identify over-memorization as the key vulnerability: LVLMs tend to over-memorize fine-tuning concepts, directly leading to hallucinations in fine-tuned models. Our finding overturns the original justification: the dominant driver is over-memorization of injected concepts, not the visual perturbation. Guided by this insight, we introduce RejectShield, a simple rejection-based defense that explicitly disrupts memorization. Across eight settings spanning attack variants, attack goals, model families, and access regimes, RejectShield reduces attack success by up to 99% while largely preserving normal performance. Finally, we discuss broader implications of this memorization vulnerability, including evaluation methods that test concept replay and training practices that mitigate memorization pressure.

Over-Memorization during LVLM Fine-tuning

Memorization comparison between LVLMs and LLMs

Finding 1: LVLMs tend to over-memorize fine-tuning concepts, leading to hallucinations in fine-tuned models.

Finding 2: Multimodal data exacerbate data memorization in LVLMs, highlighting data memorization as a critical safety vulnerability, particularly for multimodal LVLM architectures.

Our Defense Results

RejectShield defense effectiveness across different models and tasks

RejectShield consistently reduces attack success rates to near zero across multiple LVLM architectures and attack scenarios, while maintaining model utility. The pink line shows our defense effectively defends against various poisoning attack scenarios.

BibTeX

        @misc{
            ho2025memory,
            title={Memory Makes The Poison: Over Memorization Drives Visual Poisoning in {LVLM}s},
            author={Sy-Tuyen Ho and Yaseema Rusiru Ariyarathna Epa and Yasoda Lasiru Ariyarathna Epa and Andrew Mendez and Kecheng Liu and Xudong Jiang and Alex Kot and Furong Huang and Ngai-Man Cheung},
            year={2025},
            url={https://openreview.net/forum?id=2HGL1Szcp2}
            }
      

Memory Makes The PoisonOver Memorization Drives Visual Poisoning in LVLMs

Abstract

Over-Memorization during LVLM Fine-tuning

Our Defense Results

BibTeX

Memory Makes The Poison
Over Memorization Drives Visual Poisoning in LVLMs