1 code implementation • 25 Jan 2024 • Quentin Anthony, Jacob Hatef, Deepak Narayanan, Stella Biderman, Stas Bekman, Junqi Yin, Aamir Shafi, Hari Subramoni, Dhabaleswar Panda
While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models.
1 code implementation • 16 Jan 2024 • Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K., Panda
Unlike previous methods, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation.
1 code implementation • 22 May 2023 • Jinghan Yao, Nawras Alnaasan, Tian Chen, Aamir Shafi, Hari Subramoni, Dhabaleswar K., Panda
Inference on these models, by design, harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens.
no code implementations • 15 Mar 2023 • Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda
However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales.
no code implementations • 20 Oct 2021 • Nawras Alnaasan, Arpan Jain, Aamir Shafi, Hari Subramoni, Dhabaleswar K Panda
However, there is currently no benchmark suite to evaluate communication performance of mpi4py -- and Python MPI codes in general -- on modern HPC systems.
1 code implementation • 21 Jan 2021 • Aamir Shafi, Jahanzeb Maqbool Hashmi, Hari Subramoni, Dhabaleswar K., Panda
This paper presents the design and implementation of a new communication backend for Dask -- called MPI4Dask -- that is targeted for modern HPC clusters built with GPUs.
no code implementations • 12 Nov 2019 • Ammar Ahmad Awan, Arpan Jain, Quentin Anthony, Hari Subramoni, Dhabaleswar K. Panda
Four major problems we focus on are: 1) defining a notion of a distributed model across processes, 2) implementing forward/back-propagation across process boundaries that requires explicit communication, 3) obtaining parallel speedup on an inherently sequential task, and 4) achieving scalability without losing out on a model's accuracy.