XLSR is a multilingual speech recognition model built on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining. A shared quantization module over feature encoder representations produces multilingual quantized speech units whose embeddings are then used as targets for a Transformer trained by contrastive learning. The model learns to share discrete tokens across languages, creating bridges across languages.
Source: Unsupervised Cross-lingual Representation Learning for Speech RecognitionPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Speech Recognition | 13 | 30.23% |
Automatic Speech Recognition (ASR) | 7 | 16.28% |
Language Modelling | 4 | 9.30% |
Translation | 3 | 6.98% |
Language Identification | 2 | 4.65% |
Spoken language identification | 2 | 4.65% |
Cross-Lingual Transfer | 2 | 4.65% |
Retrieval | 1 | 2.33% |
Voice Conversion | 1 | 2.33% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |