Tacotron2

Introduced by Shen et al. in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

Tacotron 2 is a neural network architecture for speech synthesis directly from text. It consists of two components:

a recurrent sequence-to-sequence feature prediction network with attention which predicts a sequence of mel spectrogram frames from an input character sequence
a modified version of WaveNet which generates time-domain waveform samples conditioned on the predicted mel spectrogram frames

In contrast to the original Tacotron, Tacotron 2 uses simpler building blocks, using vanilla LSTM and convolutional layers in the encoder and decoder instead of CBHG stacks and GRU recurrent layers. Tacotron 2 does not use a “reduction factor”, i.e., each decoder step corresponds to a single spectrogram frame. Location-sensitive attention is used instead of additive attention.

Source: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Speech Synthesis	14	40.00%
Text-To-Speech Synthesis	4	11.43%
Decoder	3	8.57%
Voice Cloning	2	5.71%
Style Transfer	2	5.71%
Acoustic Modelling	1	2.86%
Voice Conversion	1	2.86%
Transliteration	1	2.86%
Zero-Shot Learning	1	2.86%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Batch Normalization	Normalization
BiLSTM	Deep Tabular Learning
Convolution	Convolutions
Dropout	Regularization
Linear Layer	Feedforward Networks
Location Sensitive Attention	Attention Mechanisms
Mixture of Logistic Distributions	Output Functions
ReLU	Activation Functions
WaveNet	Generative Audio Models
Zoneout	Regularization

Categories

Add Remove

Text-to-Speech Models