Search Results for author: Tushar Krishna

Found 42 papers, 11 papers with code

H3DFact: Heterogeneous 3D Integrated CIM for Factorization with Holographic Perceptual Representations

no code implementations • 5 Apr 2024 • Zishen Wan, Che-Kai Liu, Mohamed Ibrahim, Hanchen Yang, Samuel Spetalnick, Tushar Krishna, Arijit Raychowdhury

Disentangling attributes of various sensory signals is central to human-like perception and reasoning and a critical task for higher-order cognitive and neuro-symbolic AI systems.

Paper
Add Code

Abstracting Sparse DNN Acceleration via Structured Sparse Tensor Decomposition

no code implementations • 12 Mar 2024 • Geonhwa Jeong, Po-An Tsai, Abhimanyu R. Bambhaniya, Stephen W. Keckler, Tushar Krishna

Next, we develop a software framework, TASDER, to accelerate DNNs by searching layer-wise, high-quality structured decomposition for both weight and activation tensors so that they can be accelerated by any systems with structured sparse hardware support.

Tensor Decomposition

Paper
Add Code

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

1 code implementation • 8 Mar 2024 • Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao

Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference.

Quantization

Paper
Code

Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference

1 code implementation • 8 Mar 2024 • Akshat Ramachandran, Zishen Wan, Geonhwa Jeong, John Gustafson, Tushar Krishna

Traditional Deep Neural Network (DNN) quantization methods using integer, fixed-point, or floating-point data types struggle to capture diverse DNN parameter distributions at low precision, and often require large silicon overhead and intensive quantization-aware training.

Quantization

Paper
Code

Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

1 code implementation • 7 Feb 2024 • Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, Tushar Krishna

In this work, we study the effectiveness of existing sparse training recipes at \textit{high-sparsity regions} and argue that these methods fail to sustain the model quality on par with low-sparsity regions.

Paper
Code

Towards Cognitive AI Systems: a Survey and Prospective on Neuro-Symbolic AI

no code implementations • 2 Jan 2024 • Zishen Wan, Che-Kai Liu, Hanchen Yang, Chaojian Li, Haoran You, Yonggan Fu, Cheng Wan, Tushar Krishna, Yingyan Lin, Arijit Raychowdhury

The remarkable advancements in artificial intelligence (AI), primarily driven by deep neural networks, have significantly impacted various aspects of our lives.

Paper
Add Code

Subgraph Stationary Hardware-Software Inference Co-Design

no code implementations • 21 Jun 2023 • Payman Behnam, Jianming Tong, Alind Khare, Yangyu Chen, Yue Pan, Pranav Gadikar, Abhimanyu Rajeshkumar Bambhaniya, Tushar Krishna, Alexey Tumanov

For the stream of queries, SUSHI yields up to 25% improvement in latency, 0. 98% increase in served accuracy.

Quantization

Paper
Add Code

Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

1 code implementation • 23 May 2023 • Srinivas Sridharan, Taekyung Heo, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zheng, Brian Coutinho, Saeed Rashidi, Changhai Man, Tushar Krishna

Benchmarking and co-design are essential for driving optimizations and innovation around ML models, ML software, and next-generation hardware.

Benchmarking

Paper
Code

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

no code implementations • 11 Apr 2023 • William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Ajaya Durg, Samvit Kaul, Swati Gupta, Tushar Krishna

To this end, this paper introduces TACOS, an automated synthesizer that generates topology-aware collective algorithms for common distributed machine learning collectives across arbitrary input network topologies.

Paper
Add Code

Perspectives on AI Architectures and Co-design for Earth System Predictability

no code implementations • 7 Apr 2023 • Maruti K. Mudunuru, James A. Ang, Mahantesh Halappanavar, Simon D. Hammond, Maya B. Gokhale, James C. Hoe, Tushar Krishna, Sarat S. Sreepathi, Matthew R. Norman, Ivy B. Peng, Philip W. Jones

This paper discusses the topic of the `AI Architectures and Co-design' session and associated outcomes.

Edge-computing Uncertainty Quantification

Paper
Add Code

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

3 code implementations • 24 Mar 2023 • William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna

In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms.

193

Paper
Code

VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs

no code implementations • 17 Feb 2023 • Geonhwa Jeong, Sana Damani, Abhimanyu Rajeshkumar Bambhaniya, Eric Qin, Christopher J. Hughes, Sreenivas Subramoney, Hyesoon Kim, Tushar Krishna

Therefore, as DL workloads embrace sparsity to reduce the computations and memory size of models, it is also imperative for CPUs to add support for sparsity to avoid under-utilization of the dense matrix engine and inefficient usage of the caches and registers.

Paper
Add Code

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

no code implementations • 30 Nov 2022 • Divya Kiran Kadiyala, Saeed Rashidi, Taekyung Heo, Abhimanyu Rajeshkumar Bambhaniya, Tushar Krishna, Alexandros Daglis

To facilitate the design space exploration of such massive DL training clusters, we introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training.

Paper
Add Code

XRBench: An Extended Reality (XR) Machine Learning Benchmark Suite for the Metaverse

no code implementations • 16 Nov 2022 • Hyoukjun Kwon, Krishnakumar Nair, Jamin Seo, Jason Yik, Debabrata Mohapatra, Dongyuan Zhan, Jinook Song, Peter Capak, Peizhao Zhang, Peter Vajda, Colby Banbury, Mark Mazumder, Liangzhen Lai, Ashish Sirasao, Tushar Krishna, Harshit Khaitan, Vikas Chandra, Vijay Janapa Reddi

We hope that our work will stimulate research and lead to the development of a new generation of ML systems for XR use cases.

Paper
Add Code

Demystifying Map Space Exploration for NPUs

1 code implementation • 7 Oct 2022 • Sheng-Chun Kao, Angshuman Parashar, Po-An Tsai, Tushar Krishna

Map Space Exploration is the problem of finding optimized mappings of a Deep Neural Network (DNN) model on an accelerator.

Navigate Neural Architecture Search

Paper
Code

Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask

no code implementations • 15 Sep 2022 • Sheng-Chun Kao, Amir Yazdanbakhsh, Suvinay Subramanian, Shivani Agrawal, Utku Evci, Tushar Krishna

In this work, we focus on N:M sparsity and extensively study and evaluate various training recipes for N:M sparsity in terms of the trade-off between model accuracy and compute cost (FLOPs).

Paper
Add Code

Impact of RoCE Congestion Control Policies on Distributed Training of DNNs

no code implementations • 22 Jul 2022 • Tarannum Khan, Saeed Rashidi, Srinivas Sridharan, Pallavi Shurpali, Aditya Akella, Tushar Krishna

Our results indicate that previously proposed RoCE congestion control schemes have little impact on the end-to-end performance of training workloads, motivating the necessity of designing an optimized, yet low-overhead, congestion control scheme based on the characteristics of distributed training platforms and workloads.

Blocking

Paper
Add Code

DiGamma: Domain-aware Genetic Algorithm for HW-Mapping Co-optimization for DNN Accelerators

2 code implementations • 26 Jan 2022 • Sheng-Chun Kao, Michael Pellauer, Angshuman Parashar, Tushar Krishna

The design of DNN accelerators includes two key parts: HW resource configuration and mapping strategy.

Paper
Code

DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators

no code implementations • 26 Jan 2022 • Sheng-Chun Kao, Xiaoyu Huang, Tushar Krishna

Dataflow/mapping decides the compute and energy efficiency of DNN accelerators.

Language Modelling

Paper
Add Code

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

no code implementations • 9 Oct 2021 • Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, Tushar Krishna

Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e. g., GPU/TPU).

Scheduling

Paper
Add Code

RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU

no code implementations • 5 Oct 2021 • Geonhwa Jeong, Eric Qin, Ananda Samajdar, Christopher J. Hughes, Sreenivas Subramoney, Hyesoon Kim, Tushar Krishna

As AI-based applications become pervasive, CPU vendors are starting to incorporate matrix engines within the datapath to boost efficiency.

Paper
Add Code

LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models

no code implementations • 24 Sep 2021 • William Won, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna

As model sizes in machine learning continue to scale, distributed training is necessary to accommodate model weights within each device and to reduce training time.

Paper
Add Code

Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators

no code implementations • 15 Sep 2021 • Geonhwa Jeong, Gokcen Kestor, Prasanth Chatarasi, Angshuman Parashar, Po-An Tsai, Sivasankaran Rajamanickam, Roberto Gioiosa, Tushar Krishna

The algorithms and accelerator cost models are connected via a novel mapping abstraction that captures the map space of spatial accelerators which can be systematically pruned based on constraints from the hardware, workload, and mapper.

Paper
Add Code

AIRCHITECT: Learning Custom Architecture Design and Mapping Space

no code implementations • 16 Aug 2021 • Ananda Samajdar, Jan Moritz Joseph, Matthew Denton, Tushar Krishna

We design and train a custom network architecture called AIRCHITECT, which is capable of learning the architecture design space with as high as 94. 3% test accuracy and predicting optimal configurations which achieve on average (GeoMean) of 99. 9% the best possible performance on a test dataset with $10^5$ GEMM workloads.

Paper
Add Code

FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks

no code implementations • 13 Jul 2021 • Sheng-Chun Kao, Suvinay Subramanian, Gaurav Agrawal, Amir Yazdanbakhsh, Tushar Krishna

In contrast, FLAT unblocks transformer models for inputs with up to 64K elements

Paper
Add Code

Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication

no code implementations • 19 Jun 2021 • Gordon E. Moon, Hyoukjun Kwon, Geonhwa Jeong, Prasanth Chatarasi, Sivasankaran Rajamanickam, Tushar Krishna

There is a growing interest in custom spatial accelerators for machine learning applications.

Scheduling

Paper
Add Code

MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores

no code implementations • 28 Apr 2021 • Sheng-Chun Kao, Tushar Krishna

In particular, we focus on the problem of mapping jobs from several DNNs simultaneously on an accelerator.

Efficient Exploration

Paper
Add Code

Self-Adaptive Reconfigurable Arrays (SARA): Using ML to Assist Scaling GEMM Acceleration

no code implementations • 12 Jan 2021 • Ananda Samajdar, Michael Pellauer, Tushar Krishna

We demonstrate an instance of SARA with an accelerator we call SAGAR, which introduces a novel reconfigurable systolic array that can be configured to work as a distributed collection of smaller arrays of various sizes or as a single array with flexible aspect ratios.

Paper
Add Code

ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators using Reinforcement Learning

1 code implementation • 4 Sep 2020 • Sheng-Chun Kao, Geonhwa Jeong, Tushar Krishna

We also augment the RL approach with a genetic algorithm for further fine-tuning.

Bayesian Optimization reinforcement-learning +1

Paper
Code

CLAN: Continuous Learning using Asynchronous Neuroevolution on Commodity Edge Devices

no code implementations • 27 Aug 2020 • Parth Mannan, Ananda Samajdar, Tushar Krishna

The true impact of AI can only be fully realized if we can have AI agents continuously interacting with the real world and solving everyday problems.

Paper
Add Code

Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference

no code implementations • 19 Aug 2020 • Afshin Abdi, Saeed Rashidi, Faramarz Fekri, Tushar Krishna

In this paper, we consider the parallel implementation of an already-trained deep model on multiple processing nodes (a. k. a.

Paper
Add Code

The gem5 Simulator: Version 20.0+

no code implementations • 7 Jul 2020 • Jason Lowe-Power, Abdul Mutaal Ahmad, Ayaz Akram, Mohammad Alian, Rico Amslinger, Matteo Andreozzi, Adrià Armejach, Nils Asmussen, Brad Beckmann, Srikant Bharadwaj, Gabe Black, Gedare Bloom, Bobby R. Bruce, Daniel Rodrigues Carvalho, Jeronimo Castrillon, Lizhong Chen, Nicolas Derumigny, Stephan Diestelhorst, Wendy Elsasser, Carlos Escuin, Marjan Fariborz, Amin Farmahini-Farahani, Pouya Fotouhi, Ryan Gambord, Jayneel Gandhi, Dibakar Gope, Thomas Grass, Anthony Gutierrez, Bagus Hanindhito, Andreas Hansson, Swapnil Haria, Austin Harris, Timothy Hayes, Adrian Herrera, Matthew Horsnell, Syed Ali Raza Jafri, Radhika Jagtap, Hanhwi Jang, Reiley Jeyapaul, Timothy M. Jones, Matthias Jung, Subash Kannoth, Hamidreza Khaleghzadeh, Yuetsu Kodama, Tushar Krishna, Tommaso Marinelli, Christian Menard, Andrea Mondelli, Miquel Moreto, Tiago Mück, Omar Naji, Krishnendra Nathella, Hoa Nguyen, Nikos Nikoleris, Lena E. Olson, Marc Orr, Binh Pham, Pablo Prieto, Trivikram Reddy, Alec Roelke, Mahyar Samani, Andreas Sandberg, Javier Setoain, Boris Shingarov, Matthew D. Sinclair, Tuan Ta, Rahul Thakur, Giacomo Travaglini, Michael Upton, Nilay Vaish, Ilias Vougioukas, William Wang, Zhengrong Wang, Norbert Wehn, Christian Weis, David A. Wood, Hongil Yoon, Éder F. Zulian

The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research.

Hardware Architecture

Paper
Add Code

STONNE: A Detailed Architectural Simulator for Flexible Neural Network Accelerators

1 code implementation • 10 Jun 2020 • Francisco Muñoz-Martínez, José L. Abellán, Manuel E. Acacio, Tushar Krishna

The design of specialized architectures for accelerating the inference procedure of Deep Neural Networks (DNNs) is a booming area of research nowadays.

Paper
Code

Generative Design of Hardware-aware DNNs

no code implementations • 6 Jun 2020 • Sheng-Chun Kao, Arun Ramamurthy, Tushar Krishna

We propose a new way for autonomous quantization and HW-aware tuning.

Quantization

Paper
Add Code

Conditional Neural Architecture Search

no code implementations • 6 Jun 2020 • Sheng-Chun Kao, Arun Ramamurthy, Reed Williams, Tushar Krishna

Designing resource-efficient Deep Neural Networks (DNNs) is critical to deploy deep learning solutions over edge platforms due to diverse performance, power, and memory budgets.

Neural Architecture Search

Paper
Add Code

Marvel: A Data-centric Compiler for DNN Operators on Spatial Accelerators

no code implementations • 18 Feb 2020 • Prasanth Chatarasi, Hyoukjun Kwon, Natesh Raina, Saurabh Malik, Vaisakh Haridas, Angshuman Parashar, Michael Pellauer, Tushar Krishna, Vivek Sarkar

Searching for the optimal mappings is challenging because of the large space of mappings, and this challenge gets exacerbated with new operators and diverse accelerator configurations. To address this challenge, we propose a decoupled off-chip/on-chip approach that decomposes the mapping space into off-chip and on-chip subspaces, and first optimizes the off-chip subspace followed by the on-chip subspace.

Paper
Add Code

Co-Exploration of Neural Architectures and Heterogeneous ASIC Accelerator Designs Targeting Multiple Tasks

no code implementations • 10 Feb 2020 • Lei Yang, Zheyu Yan, Meng Li, Hyoukjun Kwon, Liangzhen Lai, Tushar Krishna, Vikas Chandra, Weiwen Jiang, Yiyu Shi

Neural Architecture Search (NAS) has demonstrated its power on various AI accelerating platforms such as Field Programmable Gate Arrays (FPGAs) and Graphic Processing Units (GPUs).

Neural Architecture Search

Paper
Add Code

Heterogeneous Dataflow Accelerators for Multi-DNN Workloads

no code implementations • 13 Sep 2019 • Hyoukjun Kwon, Liangzhen Lai, Tushar Krishna, Vikas Chandra

The results suggest that HDA is an alternative class of Pareto-optimal accelerators to RDA with strength in energy, which can be a better choice than RDAs depending on the use cases.

Distributed, Parallel, and Cluster Computing

Paper
Add Code

Reinforcement Learning based Interconnection Routing for Adaptive Traffic Optimization

2 code implementations • 13 Aug 2019 • Sheng-Chun Kao, Chao-Han Huck Yang, Pin-Yu Chen, Xiaoli Ma, Tushar Krishna

In this work, we demonstrate the promise of applying reinforcement learning (RL) to optimize NoC runtime performance.

BIG-bench Machine Learning reinforcement-learning +1

Paper
Code

SCALE-Sim: Systolic CNN Accelerator

8 code implementations • 16 Oct 2018 • Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, Tushar Krishna

Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications.

Distributed, Parallel, and Cluster Computing Hardware Architecture

316

Paper
Code

GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware

no code implementations • 3 Aug 2018 • Ananda Samajdar, Parth Mannan, Kartikay Garg, Tushar Krishna

EvE can evolve the topology and weights of neural networks completely in hardware for the task at hand, without requiring hand-optimization or backpropagation training.

Image Classification OpenAI Gym +2

Paper
Add Code

Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach Using MAESTRO

no code implementations • 4 May 2018 • Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, Tushar Krishna

The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, and they directly impact the performance and energy efficiency of DNN accelerator designs.

Scheduling valid

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.