1 code implementation • 26 Feb 2024 • Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, Zhi-Quan Luo
SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear.
1 code implementation • 17 Dec 2023 • Ziniu Li, Tian Xu, Yang Yu
These methods, either explicitly or implicitly, learn a reward model from preference data and differ in the data used for policy optimization to unlock the generalization ability of the reward model.
1 code implementation • 16 Oct 2023 • Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, Zhi-Quan Luo
Based on these properties, we develop ReMax, a new algorithm tailored for RLHF.
1 code implementation • 11 Jun 2023 • Tian Xu, Ziniu Li, Yang Yu, Zhi-Quan Luo
Adversarial imitation learning (AIL), a subset of IL methods, is particularly promising, but its theoretical foundation in the presence of unknown transitions has yet to be fully developed.
no code implementations • 13 Mar 2023 • Ziniu Li, Ke Xu, Liu Liu, Lanqing Li, Deheng Ye, Peilin Zhao
To address this issue, we propose an alternative framework that involves a human supervising the RL models and providing additional feedback in the online deployment phase.
1 code implementation • 27 Jan 2023 • Ziniu Li, Tian Xu, Yang Yu, Zhi-Quan Luo
This paper considers a situation where, besides the small amount of expert data, a supplementary dataset is available, which can be collected cheaply from sub-optimal policies.
no code implementations • 3 Aug 2022 • Tian Xu, Ziniu Li, Yang Yu, Zhi-Quan Luo
Imitation learning learns a policy from expert trajectories.
no code implementations • 22 Mar 2022 • Ziniu Li, Tian Xu, Yang Yu
In particular, we demonstrate that the sample complexity of the target Q-learning algorithm in [Lee and He, 2020] is $\widetilde{\mathcal O}(|\mathcal S|^2|\mathcal A|^2 (1-\gamma)^{-5}\varepsilon^{-2})$.
no code implementations • 5 Feb 2022 • Ziniu Li, Tian Xu, Yang Yu, Zhi-Quan Luo
First, we show that ValueDice could reduce to BC under the offline setting.
1 code implementation • ICLR 2022 • Ziniu Li, Yingru Li, Yushun Zhang, Tong Zhang, Zhi-Quan Luo
However, it is limited to the case where 1) a good feature is known in advance and 2) this feature is fixed during the training: if otherwise, RLSVI suffers an unbearable computational burden to obtain the posterior samples of the parameter in the $Q$-value function.
no code implementations • 19 Jun 2021 • Tian Xu, Ziniu Li, Yang Yu, Zhi-Quan Luo
For some MDPs, we show that vanilla AIL has a worse sample complexity than BC.
no code implementations • NeurIPS 2020 • Tian Xu, Ziniu Li, Yang Yu
In this paper, we firstly analyze the value gap between the expert policy and imitated policies by two imitation methods, behavioral cloning and generative adversarial imitation.
no code implementations • 16 Nov 2019 • Tian Xu, Ziniu Li, Yang Yu
We also show that the framework leads to the value discrepancy of GAIL in an order of O((1-\gamma)^{-1}).