Search Results for author: Cody Rushing

Found 2 papers, 2 papers with code

Explorations of Self-Repair in Language Models

1 code implementation23 Feb 2024 Cody Rushing, Neel Nanda

Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate.

Copy Suppression: Comprehensively Understanding an Attention Head

1 code implementation6 Oct 2023 Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda

We show that self-repair is implemented by several mechanisms, one of which is copy suppression, which explains 39% of the behavior in a narrow task.

Language Modelling

Cannot find the paper you are looking for? You can Submit a new open access paper.