Search Results for author: Stefan Heimersheim

Found 4 papers, 2 papers with code

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

no code implementations • 17 May 2024 • Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn

We propose that if we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable, and we provide some evidence that such a representation is likely to have sparser interactions.

Learning Theory

Paper
Add Code

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

1 code implementation • 17 May 2024 • Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel Hänni, Avery Griffin, Jörn Stöhler, Magdalena Wache, Marius Hobbhahn

We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis - the Local Interaction Basis (LIB).

Paper
Code

How to use and interpret activation patching

no code implementations • 23 Apr 2024 • Stefan Heimersheim, Neel Nanda

Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results.

Paper
Add Code

Towards Automated Circuit Discovery for Mechanistic Interpretability

2 code implementations • NeurIPS 2023 • Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso

For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation.

999

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.