Revisiting Model Inversion Evaluation

Teaser

Model Inversion results vs private training data showing false positives

Figure 1. We present the first in-depth study on the Model Inversion (MI) evaluation framework F_Curr. The MI-reconstructed images (left) are deemed successful attacks with high confidence (red scores), yet they do not visually resemble the actual private training individuals (right) — a significant false-positive problem that inflates reported attack accuracy.

MI setups evaluated

99%

Max false-positive rate found

<80%

True max privacy leakage

Private datasets assessed

Abstract

TL;DR. Model inversion attacks are being evaluated with a flawed standard. Many reconstructions counted as successful are actually false positives that do not preserve visual identity. We analyze this failure mode and introduce F_MLLM, a new evaluation framework based on multimodal large language models.

Model Inversion (MI) attacks aim to reconstruct information from private training data by exploiting access to a target model. Nearly all recent MI studies evaluate attack success using a standard framework that computes attack accuracy through a secondary evaluation model trained on the same private data and task design as the target model.

In this paper, we present the first in-depth analysis of this dominant evaluation framework and reveal a fundamental issue: many reconstructions deemed "successful" are in fact false positives that do not capture the visual identity of the target individual. We show these MI false positives satisfy the same formal conditions as Type I adversarial examples, and demonstrate extremely high false-positive transferability.

To address this, we introduce a new evaluation framework F_MLLM based on Multimodal Large Language Models, whose general-purpose visual reasoning avoids the shared-task vulnerability. We reassess 27 MI attack setups and find consistently high false-positive rates under the conventional approach — calling for a reevaluation of progress in MI research.

Key Findings

False Positives ≡ Type I Adversarial Examples

MI false positives and Type I adversarial examples are mathematically equivalent — the same construct arising under different problem contexts.

Adversarial Transferability Inflates Accuracy

MI-generated negatives exhibit abnormally high false-positive rates (up to 89–97%) across evaluation models, characteristic of adversarial behavior.

SOTA Attacks Are Overestimated

Attacks reporting over 90–100% accuracy under F_Curr never exceed 80% true success rate. Many fall below 60% under our F_MLLM.

Gemini-2.0 as Reliable Evaluator

Among tested MLLMs, Gemini-2.0 achieves 93.84% "Yes" on positive pairs and 95.59% "No" on negative pairs with zero refusal rate.

Method

F_MLLM: Our Evaluation Framework

We replace the standard evaluation model with a Multimodal LLM that uses general-purpose visual reasoning — avoiding the shared-task vulnerability that enables adversarial transferability.

F_MLLM evaluation pipeline

See our Code & prompt construction for the full prompt construction and evaluation code used to generate and run these queries.

System prompt (used for evaluation queries)

You are an expert in face recognition. Taking into account the face aging, lighting, different hair styles, wearing and not wearing of eye glasses or other accessory, do the task in the image. Only answer yes or no

Citation

@article{ho2025revisiting,
  title     = {Revisiting Model Inversion Evaluation: From Misleading Standards to Reliable Privacy Assessment},
  author    = {Ho, Sy-Tuyen and Koh, Jun Hao and Nguyen, Ngoc-Bao and Binder, Alexander and Cheung, Ngai-Man},
  journal   = {arXiv preprint arXiv:2505.03519},
  year      = {2025}
}