CVPR 2026

Align Once to Explain

Feature Alignment for Scalable B-cosification of Foundational Vision Transformers

ALOE turns DINOv3 and SigLIP2-style foundation encoders into inherently interpretable B-cos backbones, while preserving foundation-model utility.

Raphael Maser, Siddhartha Gairola, Sukrut Rao, Bernt Schiele

Max Planck Institute for Informatics, Saarland Informatics Campus

Paper Code Models

Dynamic-linear equation showing the model summary used for B-cos explanations. — ALOE backbones expose a dynamic-linear summary W(x), yielding model-inherent explanations by design.

01 Interpretable

Faithful B-cos explanations come from the model itself, not from a separate post-hoc explainer.

02 Generalist

The aligned backbone is used like the original foundation encoder in downstream pipelines.

03 Efficient

Small unlabeled alignment sets recover most of the teacher backbone quality.

Main results

Interpretability moves up, accuracy stays in the foundation-model regime

ALOE improves localization quality while keeping ImageNet accuracy close to the corresponding foundation backbones across supervised, self-supervised, and vision-language settings.

Interpretability versus ImageNet accuracy plot comparing ALOE and baselines. — ALOE keeps the foundation-model accuracy regime while moving explanations into the high-localization regime.

Abstract

Foundational vision models are strong general-purpose feature extractors, but their decisions are hard to interpret. ALOE, short for ALign Once to Explain, is a one-time, label-free feature alignment approach that converts foundational vision models into inherently interpretable B-cos variants. Once aligned, the B-cos model serves as a generalist foundation-model backbone that can be used similarly to the original encoder, while amortizing the cost of interpretability.

ALOE works across supervised, self-supervised, and vision-language pre-training paradigms, and is 100-1000x more data-efficient than training from scratch. It strongly outperforms vanilla B-cosification, retains competitive linear probing, k-NN, and zero-shot transfer performance, preserves spatially structured features for dense prediction, and yields well-localized human-interpretable explanations by design.

Method

Transform, Align, Deploy

ALOE converts a frozen ViT foundation encoder, such as DINOv3 or SigLIP2, into a B-cos student and aligns it to the teacher with unlabeled images. The result is an interpretable generalist backbone with a familiar foundation-model interface.

Overview of the ALOE pipeline: B-cos conversion, label-free alignment, and downstream deployment. — The student mirrors the teacher architecture and receives global plus token-level supervision at selected depths.

One-time alignment

A frozen teacher guides a B-cos student with a cosine feature-alignment objective on unlabeled images.

Token-aware ViT matching

Special tokens such as CLS and DINOv3 registers are preserved, enabling one-to-one feature matching.

Foundation-model usage

The aligned backbone is meant to be used like the original foundation encoder in linear probing, k-NN, zero-shot evaluation, dense prediction, and downstream pipelines.

Scaling

Interpretable backbones without giving up foundation-model utility

Across DINOv3 and SigLIP2 model scales, ALOE remains close to the original foundation model and consistently improves over vanilla B-cosification.

Accuracy comparison between ALOE, baselines, and vanilla B-cosification. — ALOE consistently outperforms vanilla B-cosification across foundation backbones and model sizes.

Data efficiency

Backbone quality is stable even with tiny alignment sets

ALOE does not retrain the foundation model from scratch. Label-free alignment already recovers most of the teacher quality with small fractions of YFCC15M, while using orders of magnitude less data than foundation-model pre-training.

Data efficiency plot for ALOE SigLIP2 alignment on YFCC15M. — SigLIP2 ViT-B/16 reaches 83.87% with 15M alignment images, only 0.33 p.p. below the teacher trained on 10B images.

Data efficiency plot for ALOE DINOv3 alignment on YFCC15M. — DINOv3 ViT-B/16 reaches 84.14% with 15M alignment images, only 0.22 p.p. below the teacher trained on 1.28B images.

Explanations

Model-inherent visual evidence

B-cos attributions follow directly from the aligned model summary W(x), avoiding task-specific post-hoc tuning.

Qualitative comparison between ALOE and post-hoc attribution methods including AttnLRP, Integrated Gradients, LeGrad, CheferCAM, and LIME. — ALOE provides model-inherent B-cos explanations and compares favorably against AttnLRP, Integrated Gradients, LeGrad, CheferCAM, and LIME.

Zero-shot model-inherent explanations for SigLIP2-aligned ALOE models. — Zero-shot explanations from an ALOE-aligned SigLIP2 image encoder.

Token-level visual grounding examples for a multimodal model using an ALOE visual encoder. — Preliminary token-level grounding when using ALOE as an interpretable visual encoder.

Code and Models

Code and model checkpoints will be published soon. The links below are prepared for release.

Code github.com/rmaser/ALOE Models huggingface.co/collections/rmaser/aloe

Upcoming checkpoints

Direct model links

ImageNet-1k linear-probe accuracy is shown for each backbone.

Citation

@inproceedings{maser2026align,
  title = {Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers},
  author = {Maser, Raphael and Gairola, Siddhartha and Rao, Sukrut and Schiele, Bernt},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year = {2026}
}