no code implementations • 8 Feb 2023 • Xubo Qin, Xiyuan Liu, Xiongfeng Zheng, Jie Liu, Yutao Zhu
Specifically, when the student models are in cross-encoder architecture, a pairwise loss of hard labels is critical for training student models, whereas the distillation objectives of intermediate Transformer layers may hurt performance.