1 code implementation • 13 May 2024 • Qi Chen, Xiubo Geng, Corby Rosset, Carolyn Buractaon, Jingwen Lu, Tao Shen, Kun Zhou, Chenyan Xiong, Yeyun Gong, Paul Bennett, Nick Craswell, Xing Xie, Fan Yang, Bryan Tower, Nikhil Rao, Anlei Dong, Wenqi Jiang, Zheng Liu, Mingqin Li, Chuanjie Liu, Zengzhong Li, Rangan Majumder, Jennifer Neville, Andy Oakley, Knut Magne Risvik, Harsha Vardhan Simhadri, Manik Varma, Yujing Wang, Linjun Yang, Mao Yang, Ce Zhang
Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals.
1 code implementation • 10 Dec 2022 • Hao Sun, Xiao Liu, Yeyun Gong, Anlei Dong, Jingwen Lu, Yan Zhang, Linjun Yang, Rangan Majumder, Nan Duan
Knowledge distillation is often used to transfer knowledge from a strong teacher model to a relatively weak student model.
1 code implementation • 21 Oct 2022 • Kun Zhou, Yeyun Gong, Xiao Liu, Wayne Xin Zhao, Yelong Shen, Anlei Dong, Jingwen Lu, Rangan Majumder, Ji-Rong Wen, Nan Duan, Weizhu Chen
Thus, we propose a simple ambiguous negatives sampling method, SimANS, which incorporates a new sampling probability distribution to sample more ambiguous negatives.
1 code implementation • 27 Sep 2022 • Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder, Nan Duan
It is common that a better teacher model results in a bad student via distillation due to the nonnegligible gap between teacher and student.
no code implementations • 5 Feb 2020 • Nuo Wang Pierse, Jingwen Lu
We found that, with objective alignment, our 768 by 3 and 512 by 3 transformer language models can reach accuracy of 83. 9%/82. 5% for concept-of-interest tagging and 73. 8%/70. 2% for acronym detection using only 200 finetuning examples per task, outperforming the 768 by 3 model pretrained without objective alignment by +4. 8%/+3. 4% and +9. 9%/+6. 3%.