no code implementations • 14 Feb 2024 • Junhan Kim, Kyungphil Park, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon
Through extensive experiments on various language models and complexity analysis, we demonstrate that aespa is accurate and efficient in quantizing Transformer models.
1 code implementation • CVPR 2023 • Yongkweon Jeon, Chungman Lee, Ho-young Kim
We also propose a post-training quantization algorithm to enhance the performance of quantized models.
no code implementations • CVPR 2022 • Yongkweon Jeon, Chungman Lee, Eulrang Cho, Yeonju Ro
We thus propose a new post-training non-uniform quantization method, called Mr. BiQ, allowing low bit-width quantization even on Transformer models.
no code implementations • 5 May 2021 • Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Baeseong Park, Yongkweon Jeon
While model compression is increasingly important because of large neural network size, compression-aware training is challenging as it needs sophisticated model modifications and longer training time. In this paper, we introduce regularization frequency (i. e., how often compression is performed during training) as a new regularization technique for a practical and efficient compression-aware training method.
no code implementations • 5 May 2021 • Byeongwook Kim, Dongsoo Lee, Yeonju Ro, Yongkweon Jeon, Se Jung Kwon, Baeseong Park, Daehwan Oh
When the number of quantization bits is relatively low, however, non-convex optimization is unavoidable to improve model accuracy.
no code implementations • 1 Jan 2021 • Se Jung Kwon, Dongsoo Lee, Yongkweon Jeon, Byeongwook Kim, Bae Seong Park, Yeonju Ro
As a practical model compression technique, parameter quantization is effective especially for language models associated with a large memory footprint.
no code implementations • Findings of the Association for Computational Linguistics 2020 • Insoo Chung, Byeongwook Kim, Yoonjung Choi, Se Jung Kwon, Yongkweon Jeon, Baeseong Park, Sangha Kim, Dongsoo Lee
Our analysis shows that for a given number of quantization bits, each block of Transformer contributes to translation quality and inference computations in different manners.
no code implementations • NeurIPS 2020 • Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Yongkweon Jeon, Baeseong Park, Jeongin Yun
Quantization based on the binary codes is gaining attention because each quantized bit can be directly utilized for computations without dequantization using look-up tables.
no code implementations • 20 May 2020 • Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee
Success of quantization in practice, hence, relies on an efficient computation engine design, especially for matrix multiplication that is a basic computation engine in most DNNs.
no code implementations • 25 Sep 2019 • Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Yongkweon Jeon, Baeseong Park, Jeongin Yun, Gu-Yeon Wei
Using various models, we show that simple weight updates to comply with compression formats along with long NR period is enough to achieve high compression ratio and model accuracy.