Search Results for author: Aleksandar Makelov

Found 4 papers, 2 papers with code

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

no code implementations14 May 2024 Aleksandar Makelov, George Lange, Neel Nanda

However, the lack of ground-truth for these features in realistic scenarios makes the validation of recent approaches, such as sparse dictionary learning, elusive.

Dictionary Learning

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

1 code implementation28 Nov 2023 Aleksandar Makelov, Georg Lange, Neel Nanda

We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice.

Attribute

Rethinking Backdoor Attacks

no code implementations19 Jul 2023 Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi Salman, Andrew Ilyas, Aleksander Madry

In a backdoor attack, an adversary inserts maliciously constructed backdoor examples into a training set to make the resulting model vulnerable to manipulation.

Backdoor Attack

Cannot find the paper you are looking for? You can Submit a new open access paper.