no code implementations • 29 May 2024 • Xiangyu Qi, Yangsibo Huang, Yi Zeng, Edoardo Debenedetti, Jonas Geiping, Luxi He, Kaixuan Huang, Udari Madhushani, Vikash Sehwag, Weijia Shi, Boyi Wei, Tinghao Xie, Danqi Chen, Pin-Yu Chen, Jeffrey Ding, Ruoxi Jia, Jiaqi Ma, Arvind Narayanan, Weijie J Su, Mengdi Wang, Chaowei Xiao, Bo Li, Dawn Song, Peter Henderson, Prateek Mittal
The exposure of security vulnerabilities in safety-aligned language models, e. g., susceptibility to adversarial attacks, has shed light on the intricate interplay between AI safety and AI security.
no code implementations • 1 Apr 2024 • Luxi He, Mengzhou Xia, Peter Henderson
Current Large Language Models (LLMs), even those tuned for safety and alignment, are susceptible to jailbreaking.
no code implementations • NeurIPS 2023 • Hao Wang, Luxi He, Rui Gao, Flavio P. Calmon
We categorize sources of discrimination in the ML pipeline into two classes: aleatoric discrimination, which is inherent in the data distribution, and epistemic discrimination, which is due to decisions made during model development.