no code implementations • 25 Apr 2024 • Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu
We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs).
no code implementations • 22 Feb 2024 • Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra
The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0. 7%/0. 8% than MobileLLM 125M/350M.
no code implementations • 20 Feb 2024 • Yang Li, Yuan Shangguan, Yuhao Wang, Liangzhen Lai, Ernie Chang, Changsheng Zhao, Yangyang Shi, Vikas Chandra
This study delves into how weight parameters in speech recognition models influence the overall power consumption of these models.
no code implementations • 14 Sep 2023 • Yang Li, Liangzhen Lai, Yuan Shangguan, Forrest N. Iandola, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas Chandra
Instead, the bottleneck lies in the linear projection layers of multi-head attention and feedforward networks, constituting a substantial portion of the model size and contributing significantly to computation, memory, and power usage.
1 code implementation • 26 Jan 2023 • Maximilian Lam, Jeff Johnson, Wenjie Xiong, Kiwan Maeng, Udit Gupta, Yang Li, Liangzhen Lai, Ilias Leontiadis, Minsoo Rhu, Hsien-Hsin S. Lee, Vijay Janapa Reddi, Gu-Yeon Wei, David Brooks, G. Edward Suh
Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to $100, 000$ queries per second -- a $>100 \times$ throughput improvement over a CPU-based baseline -- while maintaining model accuracy.
no code implementations • 7 Dec 2022 • Seah Kim, Hyoukjun Kwon, Jinook Song, Jihyuck Jo, Yu-Hsin Chen, Liangzhen Lai, Vikas Chandra
Such dynamic behaviors introduce new challenges to the system software in an ML system since the overall system load is not completely predictable, unlike traditional ML workloads.
no code implementations • 16 Nov 2022 • Hyoukjun Kwon, Krishnakumar Nair, Jamin Seo, Jason Yik, Debabrata Mohapatra, Dongyuan Zhan, Jinook Song, Peter Capak, Peizhao Zhang, Peter Vajda, Colby Banbury, Mark Mazumder, Liangzhen Lai, Ashish Sirasao, Tushar Krishna, Harshit Khaitan, Vikas Chandra, Vijay Janapa Reddi
We hope that our work will stimulate research and lead to the development of a new generation of ML systems for XR use cases.
no code implementations • 2 Nov 2021 • Cole Hawkins, Haichuan Yang, Meng Li, Liangzhen Lai, Vikas Chandra
Low-rank tensor compression has been proposed as a promising approach to reduce the memory and compute requirements of neural networks for their deployment on edge devices.
1 code implementation • CVPR 2022 • Jiaqi Gu, Hyoukjun Kwon, Dilin Wang, Wei Ye, Meng Li, Yu-Hsin Chen, Liangzhen Lai, Vikas Chandra, David Z. Pan
Therefore, we propose HRViT, which enhances ViTs to learn semantically-rich and spatially-precise multi-scale representations by integrating high-resolution multi-branch architectures with ViTs.
Ranked #25 on Semantic Segmentation on Cityscapes val
no code implementations • 13 Feb 2020 • Meng Li, Yilei Li, Pierce Chuang, Liangzhen Lai, Vikas Chandra
Neural network accelerator is a key enabler for the on-device AI inference, for which energy efficiency is an important metric.
no code implementations • 10 Feb 2020 • Lei Yang, Zheyu Yan, Meng Li, Hyoukjun Kwon, Liangzhen Lai, Tushar Krishna, Vikas Chandra, Weiwen Jiang, Yiyu Shi
Neural Architecture Search (NAS) has demonstrated its power on various AI accelerating platforms such as Field Programmable Gate Arrays (FPGAs) and Graphic Processing Units (GPUs).
no code implementations • 13 Sep 2019 • Hyoukjun Kwon, Liangzhen Lai, Tushar Krishna, Vikas Chandra
The results suggest that HDA is an alternative class of Pareto-optimal accelerators to RDA with strength in energy, which can be a better choice than RDAs depending on the use cases.
Distributed, Parallel, and Cluster Computing
no code implementations • 20 Jun 2018 • Liangzhen Lai, Naveen Suda
Machine learning (ML), especially deep learning is made possible by the availability of big data, enormous compute power and, often overlooked, development tools or frameworks.
2 code implementations • 2 Jun 2018 • Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, Vikas Chandra
Experiments show that accuracy can be increased by 30% for the CIFAR-10 dataset with only 5% globally shared data.
1 code implementation • 19 Jan 2018 • Liangzhen Lai, Naveen Suda, Vikas Chandra
Deep Neural Networks are becoming increasingly popular in always-on IoT edge devices performing data analytics right at the source, reducing latency as well as energy consumption for data communication.
no code implementations • 12 Jan 2018 • Liangzhen Lai, Naveen Suda, Vikas Chandra
Efficient and compact neural network models are essential for enabling the deployment on mobile and embedded devices.
no code implementations • 5 Dec 2017 • Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Joon Kyung Kim, Vikas Chandra, Hadi Esmaeilzadeh
Compared to Stripes, BitFusion provides 2. 6x speedup and 3. 9x energy reduction at 45 nm node when BitFusion area and frequency are set to those of Stripes.
18 code implementations • 20 Nov 2017 • Yundong Zhang, Naveen Suda, Liangzhen Lai, Vikas Chandra
We train various neural network architectures for keyword spotting published in literature to compare their accuracy and memory/compute requirements.
Ranked #14 on Keyword Spotting on Google Speech Commands
no code implementations • ICLR 2018 • Meng Li, Liangzhen Lai, Naveen Suda, Vikas Chandra, David Z. Pan
Massive data exist among user local platforms that usually cannot support deep neural network (DNN) training due to computation and storage resource constraints.
no code implementations • 8 Mar 2017 • Liangzhen Lai, Naveen Suda, Vikas Chandra
To alleviate these problems to some extent, prior research utilize low precision fixed-point numbers to represent the CNN weights and activations.