no code implementations • 8 Apr 2019 • Yangyang Shi, Mei-Yuh Hwang, Xin Lei, Haoyu Sheng
Using knowledge distillation with trust regularization, we reduce the parameter size to a third of that of the previously published best model while maintaining the state-of-the-art perplexity result on Penn Treebank data.