no code implementations • 5 Jun 2023 • Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang
For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.
no code implementations • 17 Sep 2022 • Ye Bai, Jie Li, Wenjing Han, Hao Ni, Kaituo Xu, Zhuo Zhang, Cheng Yi, Xiaorui Wang
Experimental results show that the proposed model achieves competitive performance with 1/3 of the parameters of the encoder, compared with the full-parameter model.
no code implementations • 17 Feb 2022 • Jiangyan Yi, Ruibo Fu, JianHua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, Haizhou Li, Zheng Lian, Bin Liu
Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021.
no code implementations • 15 Apr 2021 • Haoxin Ma, Jiangyan Yi, JianHua Tao, Ye Bai, Zhengkun Tian, Chenglong Wang
However, fine-tuning leads to performance degradation on previous data.
1 code implementation • 8 Apr 2021 • Jiangyan Yi, Ye Bai, JianHua Tao, Haoxin Ma, Zhengkun Tian, Chenglong Wang, Tao Wang, Ruibo Fu
Therefore, this paper develops such a dataset for half-truth audio detection (HAD).
no code implementations • 7 Apr 2021 • Zhengkun Tian, Jiangyan Yi, Ye Bai, JianHua Tao, Shuai Zhang, Zhengqi Wen
It takes a lot of computation and time to predict the blank tokens, but only the non-blank tokens will appear in the final output sequence.
1 code implementation • 4 Apr 2021 • Zhengkun Tian, Jiangyan Yi, JianHua Tao, Ye Bai, Shuai Zhang, Zhengqi Wen, Xuefei Liu
To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT), which improves the performance and accelerating the convergence of the NAR model by learning prior knowledge from a parameters-sharing AR model.
no code implementations • 15 Feb 2021 • Ye Bai, Jiangyan Yi, JianHua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang
Based on this idea, we propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once).
no code implementations • 28 Oct 2020 • Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, JianHua Tao, Zhengqi Wen
In this paper, we propose a decoupled transformer model to use monolingual paired data and unpaired text data to alleviate the problem of code-switching data shortage.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 28 Oct 2020 • Zhengkun Tian, Jiangyan Yi, Ye Bai, JianHua Tao, Shuai Zhang, Zhengqi Wen
Inspired by the success of two-pass end-to-end models, we introduce a transformer decoder and the two-stage inference method into the streaming CTC model.
no code implementations • Pattern Recognition 2020 • Bocheng Zhao, JianHua Tao, Minghao Yang, Zhengkun Tian, Cunhang Fan, Ye Bai
Calligraphy imitation (CI) from a handful of target handwriting samples is such a challenging task that most of the existing writing style analysis or handwriting generation methods do not exhibit satisfactory performance.
no code implementations • 16 May 2020 • Zhengkun Tian, Jiangyan Yi, Jian-Hua Tao, Ye Bai, Shuai Zhang, Zhengqi Wen
To address this problem and improve the inference speed, we propose a spike-triggered non-autoregressive transformer model for end-to-end speech recognition, which introduces a CTC module to predict the length of the target sequence and accelerate the convergence.
no code implementations • 11 May 2020 • Ye Bai, Jiangyan Yi, Jian-Hua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang
Without beam-search, the one-pass propagation much reduces inference time cost of LASO.
no code implementations • 1 Apr 2020 • Jiangyan Yi, Jian-Hua Tao, Ye Bai, Zhengkun Tian, Cunhang Fan
The other is that POS tags are provided by an external POS tagger.
no code implementations • 19 Feb 2020 • Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Jian-Hua Tao, Ye Bai
Recently, language identity information has been utilized to improve the performance of end-to-end code-switching (CS) speech recognition.
no code implementations • 6 Dec 2019 • Zhengkun Tian, Jiangyan Yi, Ye Bai, Jian-Hua Tao, Shuai Zhang, Zhengqi Wen
Once a fixed-length chunk of the input sequence is processed by the encoder, the decoder begins to predict symbols immediately.
no code implementations • 4 Dec 2019 • Ye Bai, Jiangyan Yi, Jian-Hua Tao, Zhengqi Wen, Zhengkun Tian, Shuai Zhang
To alleviate the above two issues, we propose a unified method called LST (Learn Spelling from Teachers) to integrate knowledge into an AED model from the external text-only data and leverage the whole context in a sentence.
no code implementations • 28 Sep 2019 • Zhengkun Tian, Jiangyan Yi, Jian-Hua Tao, Ye Bai, Zhengqi Wen
Furthermore, a path-aware regularization is proposed to assist SA-T to learn alignments and improve the performance.
no code implementations • 13 Jul 2019 • Ye Bai, Jiangyan Yi, Jian-Hua Tao, Zhengkun Tian, Zhengqi Wen
Integrating an external language model into a sequence-to-sequence speech recognition system is non-trivial.