Search Results for author: Hari Subramoni

Found 7 papers, 4 papers with code

The Case for Co-Designing Model Architectures with Hardware

1 code implementation • 25 Jan 2024 • Quentin Anthony, Jacob Hatef, Deepak Narayanan, Stella Biderman, Stas Bekman, Junqi Yin, Aamir Shafi, Hari Subramoni, Dhabaleswar Panda

While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models.

6,625

Paper
Code

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

1 code implementation • 16 Jan 2024 • Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K., Panda

Unlike previous methods, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation.

Paper
Code

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

1 code implementation • 22 May 2023 • Jinghan Yao, Nawras Alnaasan, Tian Chen, Aamir Shafi, Hari Subramoni, Dhabaleswar K., Panda

Inference on these models, by design, harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens.

Computational Efficiency

Paper
Code

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

no code implementations • 15 Mar 2023 • Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda

However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales.

Paper
Add Code

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

no code implementations • 20 Oct 2021 • Nawras Alnaasan, Arpan Jain, Aamir Shafi, Hari Subramoni, Dhabaleswar K Panda

However, there is currently no benchmark suite to evaluate communication performance of mpi4py -- and Python MPI codes in general -- on modern HPC systems.

Paper
Add Code

Efficient MPI-based Communication for GPU-Accelerated Dask Applications

1 code implementation • 21 Jan 2021 • Aamir Shafi, Jahanzeb Maqbool Hashmi, Hari Subramoni, Dhabaleswar K., Panda

This paper presents the design and implementation of a new communication backend for Dask -- called MPI4Dask -- that is targeted for modern HPC clusters built with GPUs.

Blocking Distributed Computing

Paper
Code

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow

no code implementations • 12 Nov 2019 • Ammar Ahmad Awan, Arpan Jain, Quentin Anthony, Hari Subramoni, Dhabaleswar K. Panda

Four major problems we focus on are: 1) defining a notion of a distributed model across processes, 2) implementing forward/back-propagation across process boundaries that requires explicit communication, 3) obtaining parallel speedup on an inherently sequential task, and 4) achieving scalability without losing out on a model's accuracy.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.