1 code implementation • 1 Dec 2023 • Jiacheng Yang, Christina Giannoula, Jun Wu, Mostafa Elhoushi, James Gleeson, Gennady Pekhimenko
Minuet proposes to (i) replace the hash tables used in the Map step with a novel segmented sorting double-traversed binary search algorithm that highly utilizes the on-chip memory hierarchy of GPUs, (ii) use a lightweight scheme to autotune the tile size in the Gather and Scatter operations of the GMaS step, such that to adapt the execution to the particular characteristics of each SC layer, dataset, and GPU architecture, and (iii) employ a padding-efficient GEMM grouping approach that reduces both memory padding and kernel launching overheads.
no code implementations • 15 Jul 2022 • James Gleeson, Daniel Snider, Yvonne Yang, Moshe Gabel, Eyal de Lara, Gennady Pekhimenko
We show that simulator kernel fusion speedups with a simple simulator are $11. 3\times$ and increase by up to $1024\times$ as simulator complexity increases in terms of memory bandwidth requirements.
1 code implementation • 8 Feb 2021 • James Gleeson, Srivatsan Krishnan, Moshe Gabel, Vijay Janapa Reddi, Eyal de Lara, Gennady Pekhimenko
Deep reinforcement learning (RL) has made groundbreaking advancements in robotics, data center management and other applications.