Text-to-Speech Models

Tacotron 2 is a neural network architecture for speech synthesis directly from text. It consists of two components:

  • a recurrent sequence-to-sequence feature prediction network with attention which predicts a sequence of mel spectrogram frames from an input character sequence
  • a modified version of WaveNet which generates time-domain waveform samples conditioned on the predicted mel spectrogram frames

In contrast to the original Tacotron, Tacotron 2 uses simpler building blocks, using vanilla LSTM and convolutional layers in the encoder and decoder instead of CBHG stacks and GRU recurrent layers. Tacotron 2 does not use a “reduction factor”, i.e., each decoder step corresponds to a single spectrogram frame. Location-sensitive attention is used instead of additive attention.

Source: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Speech Synthesis 14 40.00%
Text-To-Speech Synthesis 4 11.43%
Decoder 3 8.57%
Voice Cloning 2 5.71%
Style Transfer 2 5.71%
Acoustic Modelling 1 2.86%
Voice Conversion 1 2.86%
Transliteration 1 2.86%
Zero-Shot Learning 1 2.86%

Categories