no code implementations • 25 Apr 2024 • Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder de Witt, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Botos Csaba, Fabro Steibel, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Marvin Imperial, Juan A. Nolazco-Flores, Lori Landay, Matthew Jackson, Paul Röttger, Philip H. S. Torr, Trevor Darrell, Yong Suk Lee, Jakob Foerster
In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education.
no code implementations • 23 Feb 2024 • Clement Neo, Shay B. Cohen, Fazl Barez
In this paper, we investigate the interplay between attention heads and specialized "next-token" neurons in the Multilayer Perceptron that predict specific tokens.
1 code implementation • 4 Feb 2024 • Philip Quirke, Clement Neo, Fazl Barez
To exhibit the reusability of verified modules, we insert the trained integer addition model into an untrained model and train the combined model to perform both addition and subtraction.
1 code implementation • 10 Jan 2024 • Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez
We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it).
1 code implementation • 3 Jan 2024 • Michelle Lo, Shay B. Cohen, Fazl Barez
This demonstrates that models exhibit polysemantic capacities and can blend old and new concepts in individual neurons.
no code implementations • 23 Dec 2023 • Fazl Barez, Philip Torr
As artificial intelligence (AI) systems become increasingly integrated into various domains, ensuring that they align with human values becomes critical.
no code implementations • 7 Nov 2023 • Michael Lan, Fazl Barez
While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret.
3 code implementations • 19 Oct 2023 • Philip Quirke, Fazl Barez
Understanding the inner workings of machine learning models like Transformers is vital for their safe and ethical use.
no code implementations • 12 Oct 2023 • Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, Philip Torr, Fazl Barez
Large language models (LLMs) fine-tuned by reinforcement learning from human feedback (RLHF) are becoming more widely deployed.
no code implementations • 9 Oct 2023 • Kayla Matteucci, Shahar Avin, Fazl Barez, Seán Ó hÉigeartaigh
Concerns around future dangers from advanced AI often centre on systems hypothesised to have intrinsic characteristics such as agent-like behaviour, strategic awareness, and long-range planning.
1 code implementation • 3 Oct 2023 • Albert Garde, Esben Kran, Fazl Barez
By granting access to state-of-the-art interpretability methods, DeepDecipher makes LLMs more transparent, trustworthy, and safe.
1 code implementation • 31 May 2023 • Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez
Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to.
1 code implementation • 27 May 2023 • Jason Hoelscher-Obermaier, Julia Persson, Esben Kran, Ioannis Konstas, Fazl Barez
We use this improved benchmark to evaluate recent model editing techniques and find that they suffer from low specificity.
1 code implementation • 24 May 2023 • Antonio Valerio Miceli-Barone, Fazl Barez, Ioannis Konstas, Shay B. Cohen
Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming.
no code implementations • 23 Apr 2023 • Fazl Barez, Hosien Hasanbieg, Alesandro Abbate
We evaluate the satisfaction of these constraints via p-norms in state vector space.
no code implementations • 22 Apr 2023 • Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez
Understanding the function of individual neurons within language models is essential for mechanistic interpretability research.
no code implementations • 16 Apr 2023 • Ondrej Bohdal, Timothy Hospedales, Philip H. S. Torr, Fazl Barez
Successful deployment of artificial intelligence (AI) in various settings has led to numerous positive outcomes for individuals and society.
1 code implementation • 20 Feb 2023 • Fazl Barez, Paul Bilokon, Arthur Gervais, Nikita Lisitsyn
This paper explores the novel deep learning Transformers architectures for high-frequency Bitcoin-USDT log-return forecasting and compares them to the traditional Long Short-Term Memory models.
1 code implementation • 16 Mar 2022 • Pengyi Li, Hongyao Tang, Tianpei Yang, Xiaotian Hao, Tong Sang, Yan Zheng, Jianye Hao, Matthew E. Taylor, Wenyuan Tao, Zhen Wang, Fazl Barez
However, we reveal sub-optimal collaborative behaviors also emerge with strong correlations, and simply maximizing the MI can, surprisingly, hinder the learning towards better collaboration.
Multi-agent Reinforcement Learning reinforcement-learning +1