no code implementations • 16 Apr 2024 • Weitong Zhang, Zhiyuan Fan, Jiafan He, Quanquan Gu
To the best of our knowledge, Cert-LSVI-UCB is the first algorithm to achieve a constant, instance-dependent, high-probability regret bound in RL with linear function approximation for infinite runs without relying on prior distribution assumptions.
no code implementations • 16 Apr 2024 • Qiwei Di, Jiafan He, Quanquan Gu
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM).
no code implementations • 14 Feb 2024 • Chenlu Ye, Jiafan He, Quanquan Gu, Tong Zhang
We also prove a lower bound to show that the additive dependence on $C$ is optimal.
Model-based Reinforcement Learning reinforcement-learning +1
no code implementations • 14 Feb 2024 • Qiwei Di, Jiafan He, Dongruo Zhou, Quanquan Gu
Our algorithm achieves an $\tilde{\mathcal O}(dB_*\sqrt{K})$ regret bound, where $d$ is the dimension of the feature mapping in the linear transition kernel, $B_*$ is the upper bound of the total cumulative cost for the optimal policy, and $K$ is the number of episodes.
no code implementations • 14 Feb 2024 • Kaixuan Ji, Jiafan He, Quanquan Gu
Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF).
no code implementations • 26 Nov 2023 • Heyang Zhao, Jiafan He, Quanquan Gu
The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes.
no code implementations • 2 Oct 2023 • Qiwei Di, Heyang Zhao, Jiafan He, Quanquan Gu
However, limited works on offline RL with non-linear function approximation have instance-dependent regret guarantees.
no code implementations • 15 May 2023 • Kaixuan Ji, Qingyue Zhao, Jiafan He, Weitong Zhang, Quanquan Gu
Recent studies have shown that episodic reinforcement learning (RL) is no harder than bandits when the total reward is bounded by $1$, and proved regret bounds that have a polylogarithmic dependence on the planning horizon $H$.
no code implementations • 15 May 2023 • Yue Wu, Jiafan He, Quanquan Gu
Recently, there has been remarkable progress in reinforcement learning (RL) with general function approximation.
no code implementations • 10 May 2023 • Yifei Min, Jiafan He, Tianhao Wang, Quanquan Gu
We study multi-agent reinforcement learning in the setting of episodic Markov decision processes, where multiple agents cooperate via communication through a central server.
no code implementations • 16 Mar 2023 • Weitong Zhang, Jiafan He, Zhiyuan Fan, Quanquan Gu
We show that, when the misspecification level $\zeta$ is dominated by $\tilde O (\Delta / \sqrt{d})$ with $\Delta$ being the minimal sub-optimality gap and $d$ being the dimension of the contextual vectors, our algorithm enjoys the same gap-dependent regret bound $\tilde O (d^2/\Delta)$ as in the well-specified setting up to logarithmic factors.
no code implementations • 21 Feb 2023 • Heyang Zhao, Jiafan He, Dongruo Zhou, Tong Zhang, Quanquan Gu
We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs.
no code implementations • 12 Dec 2022 • Jiafan He, Heyang Zhao, Dongruo Zhou, Quanquan Gu
We study reinforcement learning (RL) with linear function approximation.
no code implementations • 7 Jul 2022 • Jiafan He, Tianhao Wang, Yifei Min, Quanquan Gu
To the best of our knowledge, this is the first provably efficient algorithm that allows fully asynchronous communication for federated contextual linear bandits, while achieving the same regret guarantee as in the single-agent setting.
no code implementations • 13 May 2022 • Jiafan He, Dongruo Zhou, Tong Zhang, Quanquan Gu
We show that for both known $C$ and unknown $C$ cases, our algorithm with proper choice of hyperparameter achieves a regret that nearly matches the lower bounds.
no code implementations • 28 Feb 2022 • Heyang Zhao, Dongruo Zhou, Jiafan He, Quanquan Gu
We study the problem of online generalized linear regression in the stochastic setting, where the label is generated from a generalized linear model with possibly unbounded additive noise.
no code implementations • 25 Oct 2021 • Yifei Min, Jiafan He, Tianhao Wang, Quanquan Gu
To the best of our knowledge, this is the first algorithm with a sublinear regret guarantee for learning linear mixture SSP.
no code implementations • 19 Oct 2021 • Chonghua Liao, Jiafan He, Quanquan Gu
To the best of our knowledge, this is the first provable privacy-preserving RL algorithm with linear function approximation.
no code implementations • NeurIPS 2021 • Jiafan He, Dongruo Zhou, Quanquan Gu
The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with linear function approximation.
no code implementations • 22 Jun 2021 • Weitong Zhang, Jiafan He, Dongruo Zhou, Amy Zhang, Quanquan Gu
For the offline counterpart, ReLEX-LCB, we show that the algorithm can find the optimal policy if the representation class can cover the state-action space and achieves gap-dependent sample complexity.
no code implementations • 17 Feb 2021 • Jiafan He, Dongruo Zhou, Quanquan Gu
In this paper, we study RL in episodic MDPs with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping, and the reward function can change arbitrarily episode by episode.
no code implementations • 23 Nov 2020 • Jiafan He, Dongruo Zhou, Quanquan Gu
Reinforcement learning (RL) with linear function approximation has received increasing attention recently.
no code implementations • NeurIPS 2021 • Jiafan He, Dongruo Zhou, Quanquan Gu
We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting.
no code implementations • 23 Jun 2020 • Dongruo Zhou, Jiafan He, Quanquan Gu
We propose a novel algorithm that makes use of the feature mapping and obtains a $\tilde O(d\sqrt{T}/(1-\gamma)^2)$ regret, where $d$ is the dimension of the feature space, $T$ is the time horizon and $\gamma$ is the discount factor of the MDP.