no code implementations • 25 Jan 2024 • Sunghee Jung, Won Jang, Jaesam Yoon, BongWan Kim
Zero-shot TTS demands additional efforts to ensure clear pronunciation and speech quality due to its inherent requirement of replacing a core parameter (speaker embedding or acoustic prompt) with a new one at the inference stage.
6 code implementations • 15 Jun 2021 • Won Jang, Dan Lim, Jaesam Yoon, BongWan Kim, Juntae Kim
Using full-band mel-spectrograms as input, we expect to generate high-resolution signals by adding a discriminator that employs spectrograms of multiple resolutions as the input.
2 code implementations • 19 Nov 2020 • Won Jang, Dan Lim, Jaesam Yoon
To preserve sound quality when the MelGAN-based structure is trained with a dataset of hundreds of speakers, we added multi-resolution spectrogram discriminators to sharpen the spectral resolution of the generated waveforms.
no code implementations • 15 May 2020 • Dan Lim, Won Jang, Gyeonghwan O, Heayoung Park, Bong-Wan Kim, Jaesam Yoon
We propose Jointly trained Duration Informed Transformer (JDI-T), a feed-forward Transformer with a duration predictor jointly trained without explicit alignments in order to generate an acoustic feature sequence from an input text.