no code implementations • 13 Dec 2021 • Pierre Liotet, Francesco Vidaich, Alberto Maria Metelli, Marcello Restelli
This hyper-policy is trained to maximize the estimated future performance, efficiently reusing past data by means of importance sampling, at the cost of introducing a controlled bias.