no code implementations • 1 Nov 2021 • Hsu Kao, Chen-Yu Wei, Vijay Subramanian
For the bandit setting, we propose a hierarchical bandit algorithm that achieves a near-optimal gap-independent regret of $\widetilde{\mathcal{O}}(\sqrt{ABT})$ and a near-optimal gap-dependent regret of $\mathcal{O}(\log(T))$, where $A$ and $B$ are the numbers of actions of the leader and the follower, respectively, and $T$ is the number of steps.
no code implementations • 25 Oct 2021 • Hsu Kao, Vijay Subramanian
Due to information asymmetry, finding optimal policies for Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) is hard with the complexity growing doubly exponentially in the horizon length.
Multi-agent Reinforcement Learning reinforcement-learning +1