Faithful B-cos explanations come from the model itself, not from a separate post-hoc explainer.
CVPR 2026
Align Once to Explain
Feature Alignment for Scalable B-cosification of Foundational Vision Transformers
ALOE turns DINOv3 and SigLIP2-style foundation encoders into inherently interpretable B-cos backbones, while preserving foundation-model utility.
Max Planck Institute for Informatics, Saarland Informatics Campus
The aligned backbone is used like the original foundation encoder in downstream pipelines.
Small unlabeled alignment sets recover most of the teacher backbone quality.
Main results
Interpretability moves up, accuracy stays in the foundation-model regime
ALOE improves localization quality while keeping ImageNet accuracy close to the corresponding foundation backbones across supervised, self-supervised, and vision-language settings.
Abstract
Foundational vision models are strong general-purpose feature extractors, but their decisions are hard to interpret. ALOE, short for ALign Once to Explain, is a one-time, label-free feature alignment approach that converts foundational vision models into inherently interpretable B-cos variants. Once aligned, the B-cos model serves as a generalist foundation-model backbone that can be used similarly to the original encoder, while amortizing the cost of interpretability.
ALOE works across supervised, self-supervised, and vision-language pre-training paradigms, and is 100-1000x more data-efficient than training from scratch. It strongly outperforms vanilla B-cosification, retains competitive linear probing, k-NN, and zero-shot transfer performance, preserves spatially structured features for dense prediction, and yields well-localized human-interpretable explanations by design.
Method
Transform, Align, Deploy
ALOE converts a frozen ViT foundation encoder, such as DINOv3 or SigLIP2, into a B-cos student and aligns it to the teacher with unlabeled images. The result is an interpretable generalist backbone with a familiar foundation-model interface.
One-time alignment
A frozen teacher guides a B-cos student with a cosine feature-alignment objective on unlabeled images.
Token-aware ViT matching
Special tokens such as CLS and DINOv3 registers are preserved, enabling one-to-one feature matching.
Foundation-model usage
The aligned backbone is meant to be used like the original foundation encoder in linear probing, k-NN, zero-shot evaluation, dense prediction, and downstream pipelines.
Scaling
Interpretable backbones without giving up foundation-model utility
Across DINOv3 and SigLIP2 model scales, ALOE remains close to the original foundation model and consistently improves over vanilla B-cosification.
Data efficiency
Backbone quality is stable even with tiny alignment sets
ALOE does not retrain the foundation model from scratch. Label-free alignment already recovers most of the teacher quality with small fractions of YFCC15M, while using orders of magnitude less data than foundation-model pre-training.
Explanations
Model-inherent visual evidence
B-cos attributions follow directly from the aligned model summary W(x), avoiding task-specific post-hoc tuning.
Code and Models
Code and model checkpoints will be published soon. The links below are prepared for release.
Upcoming checkpoints
Direct model links
ImageNet-1k linear-probe accuracy is shown for each backbone.
Citation
@inproceedings{maser2026align,
title = {Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers},
author = {Maser, Raphael and Gairola, Siddhartha and Rao, Sukrut and Schiele, Bernt},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2026}
}