Search Results for author: Aleksandar Makelov

Found 4 papers, 2 papers with code

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

no code implementations • 14 May 2024 • Aleksandar Makelov, George Lange, Neel Nanda

However, the lack of ground-truth for these features in realistic scenarios makes the validation of recent approaches, such as sparse dictionary learning, elusive.

Dictionary Learning

Paper
Add Code

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

1 code implementation • 28 Nov 2023 • Aleksandar Makelov, Georg Lange, Neel Nanda

We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice.

Attribute

Paper
Code

Rethinking Backdoor Attacks

no code implementations • 19 Jul 2023 • Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi Salman, Andrew Ilyas, Aleksander Madry

In a backdoor attack, an adversary inserts maliciously constructed backdoor examples into a training set to make the resulting model vulnerable to manipulation.

Backdoor Attack

Paper
Add Code

Towards Deep Learning Models Resistant to Adversarial Attacks

57 code implementations • ICLR 2018 • Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu

Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal.

Ranked #1 on Part-Of-Speech Tagging on Morphosyntactic-analysis-dataset

Adversarial Attack Adversarial Defense +7

6,091

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.