no code implementations • 29 Sep 2021 • Damian Boborzi, Christoph-Nikolas Straehle, Jens Stefan Buchner, Lars Mikelsons
Our training objective minimizes the Kulback-Leibler divergence between the policy and expert state transition trajectories which can be optimized in a non-adversarial fashion.