no code implementations • NeurIPS 2020 • Lin Yang, Mohammad Hajiesmaili, Mohammad Sadegh Talebi, John C. S. Lui, Wing Shing Wong
We characterize the regret of ExpRb as a function of the corruption budget and show that for the case of a known corruption budget, the regret of ExpRb is tight.
no code implementations • 9 Sep 2020 • Mohammad Sadegh Talebi, Anders Jonsson, Odalric-Ambrym Maillard
We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP).
no code implementations • ICML 2020 • Hippolyte Bourel, Odalric-Ambrym Maillard, Mohammad Sadegh Talebi
In pursuit of practical efficiency, we present UCRL3, following the lines of UCRL2, but with two key modifications: First, it uses state-of-the-art time-uniform concentration inequalities to compute confidence sets on the reward and (component-wise) transition distributions for each state-action pair.
no code implementations • 9 Oct 2019 • Mahsa Asadi, Mohammad Sadegh Talebi, Hippolyte Bourel, Odalric-Ambrym Maillard
In the case of an unknown equivalence structure, we show through numerical experiments that C-UCRL combined with ApproxEquivalence outperforms UCRL2 in ergodic MDPs.
Model-based Reinforcement Learning reinforcement-learning +1
no code implementations • NeurIPS 2019 • Mohammad Sadegh Talebi, Odalric-Ambrym Maillard
We study the problem of learning the transition matrices of a set of Markov chains from a single stream of observations on each chain.
no code implementations • 5 Mar 2018 • Mohammad Sadegh Talebi, Odalric-Ambrym Maillard
The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset.