1 code implementation • NeurIPS 2021 • R David Evans, Tor Aamodt
Parallel hardware devices (e. g., graphics processor units) have limited high-bandwidth memory capacity. This negatively impacts the training of deep neural networks (DNNs) by increasing runtime and/or decreasing accuracy when reducing model and/or batch size to fit this capacity.
13 code implementations • 19 Nov 2018 • Md Aamir Raihan, Negar Goli, Tor Aamodt
The efficacy of deep learning has resulted in it becoming one of the most important applications run in data centers today.
Mathematical Software Hardware Architecture
13 code implementations • 18 Nov 2018 • Jonathan Lew, Deval Shah, Suchita Pati, Shaylin Cattell, Mengchi Zhang, Amruth Sandhupatla, Christopher Ng, Negar Goli, Matthew D. Sinclair, Timothy G. Rogers, Tor Aamodt
Most deep neural networks deployed today are trained using GPUs via high-level frameworks such as TensorFlow and PyTorch.
Distributed, Parallel, and Cluster Computing
9 code implementations • 16 Oct 2018 • Mahmoud Khairy, Jain Akshay, Tor Aamodt, Timothy G. Rogers
Our enhanced GPU model is able to describe the NVIDIA Volta architecture in sufficient detail to reduce error in memory system even counters by as much as 66X.
Hardware Architecture
no code implementations • 17 Nov 2015 • Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, Raquel Urtasun, Andreas Moshovos
A diverse set of CNNs is analyzed showing that compared to a conventional implementation using a 32-bit floating-point representation for all layers, and with less than 1% loss in relative accuracy, the data footprint required by these networks can be reduced by an average of 74% and up to 92%.