Generative Audio Models

CTAL is a pre-training framework for strong audio-and-language representations with a Transformer, which aims to learn the intra-modality and inter-modalities connections between audio and language through two proxy tasks on a large amount of audio- and-language pairs: masked language modeling and masked cross-modal acoustic modeling. The pre-trained model is a Transformer for Audio and Language, i.e., CTAL, which consists of two modules, a language stream encoding module which adapts word as input element, and a text-referred audio stream encoder module which accepts both frame-level Mel-spectrograms and token-level output embeddings from the language stream

Source: CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Emotion Classification 1 25.00%
Language Modelling 1 25.00%
Sentiment Analysis 1 25.00%
Speaker Verification 1 25.00%

Categories