Syntactic Relevance XLNet Word Embedding Generation in Low-Resource Machine Translation

1 Jan 2021 · Nier Wu, Yatu Ji, Hongxu Hou ·

Semantic understanding is an important factor affecting the quality of machine translation of low resource agglutination language. The common methods(sub-word modeling, pre-training word embedding, etc.) will increase the length of the sequence, which leads to a surge in computation. At the same time, the pre-training word embedding with rich context is also the precondition to improve the semantic understanding. Although BERT uses masked language model to generate dynamic embedding in parallel, however, the fine-tuning strategy without mask will make it inconsistent with the training data, which produced human error. Therefore, we proposed a word embedding generation method based on improved XLNet, it corrects the defects of the BERT model, and improves the sampling redundancy issues in traditional XLNet. Experiments are carried out on CCMT2019 Mongolian-Chinese, Uyghur-Chinese and Tibetan-Chinese tasks, the results show that the generalization ability and BLEU scores of our method are improved compared with the baseline, which fully verifies the effectiveness of the method.

PDF Abstract