NeurIPS 2024 · Main Track

Vision Transformer NAS for
Out-of-Distribution Generalization:
Benchmark and Insights

Sy-Tuyen Ho1* Tuan Van Vo1* Somayeh Ebrahimkhani1* Ngai-Man Cheung1†
1Singapore University of Technology and Design (SUTD)
* Equal Contribution  ·  † Corresponding Author
OoD-ViT-NAS overview showing Kendall τ correlations across OoD datasets, Training-free NAS proxies, and ViT architectural attributes
Figure 1. We propose OoD-ViT-NAS, the first comprehensive benchmark for NAS on OoD generalization of ViT architectures. The heatmap shows Kendall τ ranking correlations between OoD accuracy and (a) ID accuracy, (b) 9 Training-free NAS proxies, and (c) ViT architectural attributes. Key finding: embedding dimension consistently has the highest correlation with OoD accuracy, while existing Training-free NAS methods largely fail at predicting OoD accuracy despite excelling at ID.
3,000
ViT architectures evaluated
8
OoD benchmark datasets
11.85%
Max OoD accuracy range
9
Training-free NAS methods studied
TL;DR

We build the first NAS benchmark for ViT OoD generalization (3,000 architectures × 8 datasets), show that ID accuracy is a poor OoD proxy, that existing Training-free NAS methods largely fail at OoD prediction, and that increasing embedding dimension is the single most effective architectural lever for improving OoD robustness.

While Vision Transformers have achieved success across various machine learning tasks, deploying them in real-world scenarios faces a critical challenge: generalizing under Out-of-Distribution (OoD) shifts. A crucial research gap remains in understanding how to design ViT architectures — both manually and automatically — to excel in OoD generalization.

To address this gap, we introduce OoD-ViT-NAS, the first systematic benchmark for ViT Neural Architecture Search focused on OoD generalization. This benchmark includes 3,000 ViT architectures of varying computational budgets evaluated on 8 common large-scale OoD datasets. Our analysis uncovers that ViT architecture designs have a considerable impact on OoD accuracy (up to 11.85%); that ID accuracy is often a poor indicator of OoD accuracy; that existing Training-free NAS methods are largely ineffective at predicting OoD accuracy; and that simple proxies like #Param or #Flops surprisingly outperform more complex methods. Finally, we discover that increasing embedding dimensions generally enhances OoD performance — a finding traceable to improved learning of high-frequency components, outperforming SOTA domain-invariant training methods under comparable settings.

Four Insights from OoD-ViT-NAS

01

Architecture Design Matters for OoD

ViT architectures exhibit a wide OoD accuracy range — up to 11.85% for some shifts — comparable to or exceeding the 1.9% gain from SOTA domain-invariant training (HYPO). Architectural choice is as impactful as training strategy.

02

ID Accuracy is a Poor OoD Proxy

Kendall τ between ID and OoD accuracy is consistently low across all 8 datasets. Pareto-optimal architectures for ID accuracy generally perform sub-optimally under OoD shifts. Optimizing for ID alone is risky.

03

Training-free NAS Fails at OoD

All 9 studied Training-free NAS proxies — including ViT-specific DSS and AutoProx — show significantly weakened effectiveness for OoD prediction. Surprisingly, simple #Param and #Flops outperform all complex methods (τ ≈ 0.36 vs 0.33).

04

Embedding Dimension is the Key Attribute

Among all ViT structural attributes, embedding dimension has by far the highest OoD correlation (τ ≈ 0.45). Depth shows only modest impact (0.19), while MLP ratio and #heads are negligible (<0.10). Increasing embed dim improves learning of high-frequency components.

How OoD-ViT-NAS is Built

Building NAS benchmarks is notoriously expensive. We leverage One-Shot NAS (Autoformer) to efficiently sample and evaluate 3,000 ViT architectures without individual training — inheriting supernet weights for comparable performance in ~3,900 GPU-hours total.

01

Search Space — Autoformer Tiny / Small / Base

Five architectural attributes per block: embedding dimension, Q-K-V dimension, number of attention heads, MLP ratio, and network depth. 1,000 architectures randomly sampled per supernet size, spanning a wide range of computational budgets.

02

Efficient Evaluation via One-Shot NAS

Subnets inherit weights from pre-trained supernets. Their performance has been shown to be comparable to, or even superior to, architectures trained from scratch. This enables efficient large-scale benchmarking (224×224 input, standard ImageNet normalization).

03

8 OoD Datasets Across 3 Shift Types

Algorithmic shifts: ImageNet-C (15 corruptions × 5 severities), ImageNet-P. Natural shifts: ImageNet-A, ImageNet-O, ImageNet-R, ImageNet-Sketch, Stylized ImageNet. Generative shifts: ImageNet-D (Stable Diffusion backgrounds, textures, materials). Metrics: ID Acc, OoD Acc, AUPR.

ImageNet-C ImageNet-P ImageNet-A ImageNet-O ImageNet-R ImageNet-Sketch Stylized ImageNet ImageNet-D

Training-free NAS vs. OoD Accuracy Prediction

Kendall τ ranking correlation between proxy values and ID/OoD accuracy, averaged across 8 OoD datasets and 3 search spaces. Simple proxies beat all sophisticated methods on OoD prediction — posing an open challenge to the NAS community.

Method Proposed For Architecture Corr. w/ ID Acc Corr. w/ OoD Acc
GraspID AccCNNs0.1490.121
SNIPID AccCNNs0.3750.289
MeCoID AccCNNs0.1440.098
CroZeAdv. RobustnessCNNs0.3820.295
JacobianAdv. RobustnessCNNs0.1050.084
DSSID AccViTs0.4170.342
AutoProx-AID AccViTs0.4020.330
#Param0.4610.360
#Flops0.4710.354

All Training-free NAS methods consistently fail on ImageNet-D due to its Stable Diffusion generation process, which creates highly atypical distributions.

Embedding Dimension Drives OoD Robustness

Among all ViT structural attributes, embedding dimension has by far the highest correlation with OoD accuracy (τ ≈ 0.45 avg). Depth shows only a slight impact (τ ≈ 0.19), while MLP ratio and #heads exhibit very low correlation (<0.10).

Our frequency analysis reveals why: increasing embedding dimension helps ViTs learn more High-Frequency Components (HFC). Models learning more HFC achieve better OoD generalization. This insight drives a practical design rule — by only increasing the embedding dimension of ViT-B-32 (768 → 840), our architecture outperforms compound-scaled ViT-L-32 in OoD accuracy with far fewer parameters (98.6M vs 305.6M), achieving IN-R OoD Acc of 48.28% vs 44.33%.

0.45
Avg. Kendall τ: Embed Dim → OoD
0.19
Avg. Kendall τ: Depth → OoD
0.09
Avg. Kendall τ: MLP Ratio → OoD
0.07
Avg. Kendall τ: #Heads → OoD

BibTeX

@article{ho2025vision,
  title = {Vision Transformer Neural Architecture Search for Out-of-Distribution Generalization: Benchmark and Insights},
  author = {Ho, Sy-Tuyen and Van Vo, Tuan and Ebrahimkhani, Somayeh and Cheung, Ngai-Man},
  journal = {arXiv preprint arXiv:2501.03782},
  year = {2025}
}