2 code implementations • 2 Apr 2024 • Yuezhou Hu, Kang Zhao, Weiyu Huang, Jianfei Chen, Jun Zhu
Utilizing this metric, we propose three techniques to preserve accuracy: to modify the sparse-refined straight-through estimator by applying the masked decay term on gradients, to determine a feasible decay factor in warm-up stage, and to enhance the model's quality by a dense fine-tuning procedure near the end of pre-training.