BEE Aware of Spuriousness: Mechanistic Interpretability for Fine Tuning Foundation Models

Introduction Fine tuning is usually framed as “adaptation”. In practice, it can also manufacture shortcuts. A model can recognize the “right” object or phrase and still bet on the wrong cue because that cue was cheaper and more reliable inside the training distribution. The scary part is how quietly this happens: if the shortcut exists in both train and validation splits, metrics can look great right up until deployment. In our ICLR 2026 paper “Bridging Explainability and Embeddings: BEE Aware of Spuriousness”, we introduce BEE, a diagnostic tool that surfaces spurious correlations by analyzing weight space drift and embedding geometry rather than relying only on held out validation data. ...

February 25, 2026