Search Results for author: Peter Hase

Found 20 papers, 18 papers with code

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

1 code implementation • 15 Apr 2024 • Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger

This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs).

Paper
Code

Rethinking Machine Unlearning for Large Language Models

no code implementations • 13 Feb 2024 • Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Xiaojun Xu, Yuguang Yao, Hang Li, Kush R. Varshney, Mohit Bansal, Sanmi Koyejo, Yang Liu

We explore machine unlearning (MU) in the domain of large language models (LLMs), referred to as LLM unlearning.

Machine Unlearning Management +2

Paper
Add Code

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

1 code implementation • 12 Jan 2024 • Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe

In this paper, we present the surprising conclusion that current language models often generalize relatively well from easy to hard data, even performing as well as "oracle" models trained on hard data.

General Knowledge In-Context Learning +1

Paper
Code

Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks

1 code implementation • 29 Sep 2023 • Vaidehi Patil, Peter Hase, Mohit Bansal

Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover "deleted" information from an edited model 38% of the time.

Model Editing

Paper
Code

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.

reinforcement-learning

Paper
Add Code

Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization

1 code implementation • 15 Jun 2023 • Swarnadeep Saha, Peter Hase, Mohit Bansal

We first show that teacher LLMs can indeed intervene on student reasoning to improve their performance.

Paper
Code

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

1 code implementation • NeurIPS 2023 • Peter Hase, Mohit Bansal, Been Kim, Asma Ghandeharioun

This finding raises questions about how past work relies on Causal Tracing to select which model layers to edit.

Denoising knowledge editing

Paper
Code

Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations

1 code implementation • 14 Nov 2022 • Swarnadeep Saha, Peter Hase, Nazneen Rajani, Mohit Bansal

We observe that (1) GPT-3 explanations are as grammatical as human explanations regardless of the hardness of the test samples, (2) for easy examples, GPT-3 generates highly supportive explanations but human explanations are more generalizable, and (3) for hard examples, human explanations are significantly better than GPT-3 explanations both in terms of label-supportiveness and generalizability judgements.

Paper
Code

Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees

1 code implementation • 21 Sep 2022 • Swarnadeep Saha, Shiyue Zhang, Peter Hase, Mohit Bansal

We demonstrate that SP-Search effectively represents the generative process behind human summaries using modules that are typically faithful to their intended behavior.

Abstractive Text Summarization Sentence +1

Paper
Code

VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives

1 code implementation • 22 Jun 2022 • Zhuofan Ying, Peter Hase, Mohit Bansal

In this paper, we show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason (RRR) metrics by optimizing for four key model objectives: (1) accurate predictions given limited but sufficient information (Sufficiency); (2) max-entropy predictions given no important information (Uncertainty); (3) invariance of predictions to changes in unimportant features (Invariance); and (4) alignment between model FI explanations and human FI explanations (Plausibility).

Feature Importance Question Answering +2

Paper
Code

GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models

2 code implementations • 14 Mar 2022 • Archiki Prasad, Peter Hase, Xiang Zhou, Mohit Bansal

Providing natural language instructions in prompts is a useful new paradigm for improving task performance of large language models in a zero-shot setting.

132

Paper
Code

Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs

1 code implementation • 26 Nov 2021 • Peter Hase, Mona Diab, Asli Celikyilmaz, Xian Li, Zornitsa Kozareva, Veselin Stoyanov, Mohit Bansal, Srinivasan Iyer

In this paper, we discuss approaches to detecting when models have beliefs about the world, and we improve on methods for updating model beliefs to be more truthful, with a focus on methods based on learned optimizers or hypernetworks.

Paper
Code

Low-Cost Algorithmic Recourse for Users With Uncertain Cost Functions

1 code implementation • 1 Nov 2021 • Prateek Yadav, Peter Hase, Mohit Bansal

Current approaches try to optimize for the cost incurred by users when adopting a recourse, but they assume that all users share the same cost function.

Fairness

Paper
Code

The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations

1 code implementation • NeurIPS 2021 • Peter Hase, Harry Xie, Mohit Bansal

In this paper, we study several under-explored dimensions of FI explanations, providing conceptual and empirical improvements for this form of explanation.

counterfactual Feature Importance +2

Paper
Code

When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data

1 code implementation • LNLS (ACL) 2022 • Peter Hase, Mohit Bansal

In order to carefully control important properties of the data and explanations, we introduce a synthetic dataset for experiments, and we also make use of three existing datasets with explanations: e-SNLI, TACRED, and SemEval.

Retrieval

Paper
Code

FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging

1 code implementation • EMNLP 2021 • Han Guo, Nazneen Fatema Rajani, Peter Hase, Mohit Bansal, Caiming Xiong

With the availability of the fast influence functions, we demonstrate their usefulness in four applications.

Data Augmentation

Paper
Code

Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?

1 code implementation • Findings of the Association for Computational Linguistics 2020 • Peter Hase, Shiyue Zhang, Harry Xie, Mohit Bansal

We provide code for the experiments in this paper at https://github. com/peterbhase/LAS-NL-Explanations

Explanation Generation

Paper
Code

Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?

1 code implementation • ACL 2020 • Peter Hase, Mohit Bansal

Through two kinds of simulation tests involving text and tabular data, we evaluate five explanations methods: (1) LIME, (2) Anchor, (3) Decision Boundary, (4) a Prototype model, and (5) a Composite approach that combines explanations from each method.

counterfactual tabular-classification

Paper
Code

Interpretable Image Recognition with Hierarchical Prototypes

1 code implementation • 25 Jun 2019 • Peter Hase, Chaofan Chen, Oscar Li, Cynthia Rudin

Hence, we may find distinct explanations for the prediction an image receives at each level of the taxonomy.

General Classification

Paper
Code

Shall I Compare Thee to a Machine-Written Sonnet? An Approach to Algorithmic Sonnet Generation

2 code implementations • 13 Nov 2018 • John Benhardt, Peter Hase, Liuyi Zhu, Cynthia Rudin

We provide an approach for generating beautiful poetry.

Sonnet Generation

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.