no code implementations • 25 Mar 2024 • Titouan Renard, Andreas Schlaginhaufen, Tingting Ni, Maryam Kamgarpour
Furthermore, with $\mathcal{O}(1/\varepsilon^{4})$ samples we prove that the optimal policy corresponding to the recovered reward is $\varepsilon$-close to the expert policy in total variation distance.
no code implementations • 1 Dec 2023 • Tingting Ni, Maryam Kamgarpour
In particular, in contrast to existing CMDP approaches that ensure policy feasibility only upon convergence, our algorithm guarantees the feasibility of the policies during the learning process and converges to the $\varepsilon$-optimal policy with a sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-6})$.