no code implementations • 14 Feb 2024 • Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
no code implementations • 8 Jan 2024 • Jiahao Qiu, Hui Yuan, Jinghong Zhang, Wentao Chen, Huazheng Wang, Mengdi Wang
To enhance the efficiency of such a process, we propose a tree search-based bandit learning method, which expands a tree starting from the initial sequence with the guidance of a bandit machine learning model.
no code implementations • 18 Oct 2022 • Shuo Xie, Jiahao Qiu, Ankita Pasad, Li Du, Qing Qu, Hongyuan Mei
We propose to select layers based on the variability of their hidden states given a task-specific corpus.