no code implementations • 29 Oct 2023 • Zhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun, Qilin Zheng, Yongkai Wu, Ang Li, Hai "Helen" Li, Yiran Chen
Specifically, SiDA attains a remarkable speedup in MoE inference with up to 3. 93X throughput increasing, up to 75% latency reduction, and up to 80% GPU memory saving with down to 1% performance drop.