1 code implementation • 16 Jan 2024 • Zhenhui Ye, Tianyun Zhong, Yi Ren, Jiaqi Yang, Weichuang Li, Jiawei Huang, Ziyue Jiang, Jinzheng He, Rongjie Huang, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video.
no code implementations • 15 Nov 2023 • Jin Qiu, Lu Huang, Boyu Li, Jun Zhang, Lu Lu, Zejun Ma
Deep biasing for the Transducer can improve the recognition performance of rare words or contextual entities, which is essential in practical applications, especially for streaming Automatic Speech Recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
1 code implementation • 20 Oct 2023 • Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music.
2 code implementations • 9 Oct 2023 • Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs.
no code implementations • 25 Sep 2023 • Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
Q-Former-based LLMs can generalise well to out-of-domain datasets, where 12% relative WER reductions over the Whisper baseline ASR model were achieved on the Eval2000 test set without using any in-domain training data from Switchboard.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 14 Jul 2023 • Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao
However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage.
no code implementations • 27 Jun 2023 • Yahuan Cong, Haoyu Zhang, Haopeng Lin, Shichao Liu, Chunfeng Wang, Yi Ren, Xiang Yin, Zejun Ma
Cross-lingual timbre and style generalizable text-to-speech (TTS) aims to synthesize speech with a specific reference timbre or style that is never trained in the target language.
1 code implementation • 14 Jun 2023 • Xinghua Qu, Hongyang Liu, Zhu Sun, Xiang Yin, Yew Soon Ong, Lu Lu, Zejun Ma
Conversational recommender systems (CRSs) have become crucial emerging research topics in the field of RSs, thanks to their natural advantages of explicitly acquiring user preferences via interactive conversations and revealing the reasons behind recommendations.
no code implementations • 9 Jun 2023 • Xianzhao Chen, Yist Y. Lin, Kang Wang, Yi He, Zejun Ma
In this paper, we improve the frame-level classifier for word timings in E2E system by introducing label priors in connectionist temporal classification (CTC) loss, which is adopted from prior works, and combining low-level Mel-scale filter banks with high-level ASR encoder output as input feature.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 7 Jun 2023 • Lu Huang, Boyu Li, Jun Zhang, Lu Lu, Zejun Ma
Domain adaptation using text-only corpus is challenging in end-to-end(E2E) speech recognition.
no code implementations • 6 Jun 2023 • Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao
3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies.
no code implementations • 6 Jun 2023 • Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao
We are interested in a novel task, namely low-resource text-to-talking avatar.
no code implementations • 5 Jun 2023 • Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang
For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.
no code implementations • 29 May 2023 • Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao
Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data.
Ranked #7 on Audio Generation on AudioCaps
no code implementations • 27 May 2023 • Linhao Dong, Zhecheng An, Peihao Wu, Jun Zhang, Lu Lu, Zejun Ma
We also observe the cross-modal representation extracted by CIF-PT obtains better performance than other neural interfaces for the tasks of SLU, including the dominant speech representation learned from self-supervised pre-training.
no code implementations • 19 May 2023 • Kaiqi Fu, Shaojun Gao, Shuju Shi, Xiaohai Tian, Wei Li, Zejun Ma
Specifically, we first pre-train the model using a reconstruction loss function, by masking phones and their durations jointly on a large amount of unlabeled speech and text prompts.
no code implementations • 1 May 2023 • Zhenhui Ye, Jinzheng He, Ziyue Jiang, Rongjie Huang, Jiawei Huang, Jinglin Liu, Yi Ren, Xiang Yin, Zejun Ma, Zhou Zhao
Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.
1 code implementation • 26 Apr 2023 • Bing Wang, Xinnian Liang, Jian Yang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, Zhoujun Li
Large Language Models (LLMs) are constrained by their inability to process lengthy inputs, resulting in the loss of critical historical information.
no code implementations • 21 Mar 2023 • Xingjian Du, Zijie Wang, Xia Liang, Huidong Liang, Bilei Zhu, Zejun Ma
Deep learning based methods have become a paradigm for cover song identification (CSI) in recent years, where the ByteCover systems have achieved state-of-the-art results on all the mainstream datasets of CSI.
no code implementations • 2 Mar 2023 • Chunfeng Wang, Peisong Huang, Yuxiang Zou, Haoyu Zhang, Shichao Liu, Xiang Yin, Zejun Ma
As a key component of automated speech recognition (ASR) and the front-end in text-to-speech (TTS), grapheme-to-phoneme (G2P) plays the role of converting letters to their corresponding pronunciations.
no code implementations • 21 Feb 2023 • Wei Liu, Kaiqi Fu, Xiaohai Tian, Shuju Shi, Wei Li, Zejun Ma, Tan Lee
Recent studies on pronunciation scoring have explored the effect of introducing phone embeddings as reference pronunciation, but mostly in an implicit manner, i. e., addition or concatenation of reference phone embedding and actual pronunciation of the target phone as the phone-level pronunciation quality representation.
no code implementations • 20 Feb 2023 • Wei Liu, Kaiqi Fu, Xiaohai Tian, Shuju Shi, Wei Li, Zejun Ma, Tan Lee
A typical fluency scoring system generally relies on an automatic speech recognition (ASR) system to obtain time stamps in input speech for either the subsequent calculation of fluency-related features or directly modeling speech fluency with an end-to-end approach.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • ICCV 2023 • Zhi Li, Pengfei Wei, Xiang Yin, Zejun Ma, Alex C. Kot
In our method, human pose and garment keypoints are extracted from source images and constructed as graphs to predict the garment keypoints at the target pose.
no code implementations • 12 Dec 2022 • Junhui Zhang, Junjie Pan, Xiang Yin, Zejun Ma
Speech-to-speech translation directly translates a speech utterance to another between different languages, and has great potential in tasks such as simultaneous interpretation.
1 code implementation • 13 Nov 2022 • Haotong Qin, Xudong Ma, Yifu Ding, Xiaoyang Li, Yang Zhang, Zejun Ma, Jiakai Wang, Jie Luo, Xianglong Liu
We highlight that benefiting from the compact architecture and optimized hardware kernel, BiFSMNv2 can achieve an impressive 25. 1x speedup and 20. 2x storage-saving on edge hardware.
1 code implementation • 7 Nov 2022 • Huidong Liang, Xingjian Du, Bilei Zhu, Zejun Ma, Ke Chen, Junbin Gao
Existing graph contrastive learning methods rely on augmentation techniques based on random perturbations (e. g., randomly adding or dropping edges and nodes).
no code implementations • 2 Nov 2022 • Rao Ma, Xiaobo Wu, Jin Qiu, Yanan Qin, HaiHua Xu, Peihao Wu, Zejun Ma
The proposed method can achieve significantly better performance on the target test sets while it gets minimal performance degradation on the general test set, compared with both shallow and ILME-based LM fusion methods.
no code implementations • 28 Oct 2022 • Yist Y. Lin, Tao Han, HaiHua Xu, Van Tung Pham, Yerbolat Khassanov, Tze Yuang Chong, Yi He, Lu Lu, Zejun Ma
One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched.
1 code implementation • KDD 2022 • Xinghua Qu, Yew-Soon Ong, Abhishek Gupta, Pengfei Wei, Zhu Sun, Zejun Ma
Given such an issue, we denote the \emph{frame importance} as its contribution to the expected reward on a particular frame, and hypothesize that adapting such frame importance could benefit the performance of the distilled student policy.
no code implementations • 10 Jun 2022 • Junhui Zhang, Wudi Bao, Junjie Pan, Xiang Yin, Zejun Ma
In this paper, we propose a novel Chinese dialect TTS frontend with a translation module, which converts Mandarin text into dialectic expressions to improve the intelligibility and naturalness of synthesized speech.
no code implementations • Findings (NAACL) 2022 • Yu Lin, Zhecheng An, Peihao Wu, Zejun Ma
To tackle this issue, we propose an auxiliary gloss regularizer module to BERT pre-training (GR-BERT), to enhance word semantic similarity.
no code implementations • ICASSP 2022 • Xingjian Du, Ke Chen, Zijie Wang, Bilei Zhu, Zejun Ma
Convolutional neural network (CNN)-based methods have dominated the recent research of cover song identification (CSI).
Ranked #1 on Cover song identification on SHS100K-TEST
no code implementations • 9 Mar 2022 • Yizhou Lu, Mingkun Huang, Xinghua Qu, Pengfei Wei, Zejun Ma
It makes room for language specific modeling by pruning out unimportant parameters for each language, without requiring any manually designed language specific component.
no code implementations • 1 Mar 2022 • Kaiqi Fu, Shaojun Gao, Kai Wang, Wei Li, Xiaohai Tian, Zejun Ma
Moreover, we utilize multi-source information (e. g., MFCC and deep features) to further improve the scoring system performance.
1 code implementation • 21 Feb 2022 • Hang Zhao, Chen Zhang, Belei Zhu, Zejun Ma, Kejun Zhang
To our knowledge, S3T is the first method combining the Swin Transformer with a self-supervised learning method for music classification.
1 code implementation • 14 Feb 2022 • Haotong Qin, Xudong Ma, Yifu Ding, Xiaoyang Li, Yang Zhang, Yao Tian, Zejun Ma, Jie Luo, Xianglong Liu
Then, to allow the instant and adaptive accuracy-efficiency trade-offs at runtime, we also propose a Thinnable Binarization Architecture to further liberate the acceleration potential of the binarized network from the topology perspective.
no code implementations • 9 Feb 2022 • Chen Shen, Yi Liu, Wenzhi Fan, Bin Wang, Shixue Wen, Yao Tian, Jun Zhang, Jingsheng Yang, Zejun Ma
For Track 1, we propose several approaches to empower the clustering-based speaker diarization system to handle overlapped speech.
1 code implementation • 2 Feb 2022 • Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov
To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time.
Ranked #4 on Sound Event Detection on DESED
1 code implementation • 30 Jan 2022 • Minglun Han, Linhao Dong, Zhenlin Liang, Meng Cai, Shiyu Zhou, Zejun Ma, Bo Xu
Nowadays, most methods in end-to-end contextual speech recognition bias the recognition process towards contextual knowledge.
no code implementations • 26 Jan 2022 • Yufei Liu, Rao Ma, HaiHua Xu, Yi He, Zejun Ma, Weibin Zhang
In this paper we propose two novel approaches to estimate the ILM based on Listen-Attend-Spell (LAS) framework.
no code implementations • 17 Jan 2022 • Tianyi Xie, Liucheng Liao, Cheng Bi, Benlai Tang, Xiang Yin, Jianfei Yang, Mingjie Wang, Jiali Yao, Yang Zhang, Zejun Ma
The task of few-shot visual dubbing focuses on synchronizing the lip movements with arbitrary speech input for any talking head video.
1 code implementation • 15 Dec 2021 • Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov
Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training.
Ranked #1 on Audio Source Separation on AudioSet
no code implementations • AAAI 2021 • Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov
Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training.
1 code implementation • 14 Oct 2021 • Jingning Xu, Benlai Tang, Mingjie Wang, Siyuan Bian, Wenyi Guo, Xiang Yin, Zejun Ma
To tackle this problem, most recent AdaIN-based architectures are proposed to extract clothes and scenario features for generation.
no code implementations • 10 Oct 2021 • Chao Wang, Zhonghao Li, Benlai Tang, Xiang Yin, Yuan Wan, Yibiao Yu, Zejun Ma
Experiments show that, compared with the baseline models, our proposed model can significantly improve the naturalness of converted singing voices and the similarity with the target singer.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 8 Oct 2021 • Pengfei Wu, Junjie Pan, Chenchang Xu, Junhui Zhang, Lin Wu, Xiang Yin, Zejun Ma
In expressive speech synthesis, there are high requirements for emotion interpretation.
no code implementations • 8 Oct 2021 • Shaoshi Ling, Chen Shen, Meng Cai, Zejun Ma
In the recent trend of semi-supervised speech recognition, both self-supervised representation learning and pseudo-labeling have shown promising results.
no code implementations • 29 Sep 2021 • Xinghua Qu, Pengfei Wei, Mingyong Gao, Zhu Sun, Yew-Soon Ong, Zejun Ma
Adversarial examples in automatic speech recognition (ASR) are naturally sounded by humans yet capable of fooling well trained ASR models to transcribe incorrectly.
no code implementations • 2 Apr 2021 • Lu Huang, Jingyu Sun, Yufeng Tang, JunFeng Hou, Jinkun Chen, Jun Zhang, Zejun Ma
This work describes an encoder pre-training procedure using frame-wise label to improve the training of streaming recurrent neural network transducer (RNN-T) model.
no code implementations • 27 Nov 2020 • Pengfei Wei, Xinghua Qu, Yew Soon Ong, Zejun Ma
Existing studies usually assume that the learned new feature representation is \emph{domain-invariant}, and thus train a transfer model $\mathcal{M}$ on the source domain.
no code implementations • 3 Nov 2020 • Mingkun Huang, Meng Cai, Jun Zhang, Yang Zhang, Yongbin You, Yi He, Zejun Ma
In this work we propose an inference technique, asynchronous revision, to unify streaming and non-streaming speech recognition models.
no code implementations • 3 Nov 2020 • Mingkun Huang, Jun Zhang, Meng Cai, Yang Zhang, Jiali Yao, Yongbin You, Yi He, Zejun Ma
In this work, we analyze the cause of the huge gradient variance in RNN-T training and proposed a new \textit{normalized jointer network} to overcome it.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 28 Oct 2020 • Zhonghao Li, Benlai Tang, Xiang Yin, Yuan Wan, Ling Xu, Chen Shen, Zejun Ma
Singing voice conversion (SVC) aims to convert the voice of one singer to that of other singers while keeping the singing content and melody.
1 code implementation • 27 Oct 2020 • Xingjian Du, Zhesong Yu, Bilei Zhu, Xiaoou Chen, Zejun Ma
We present in this paper ByteCover, which is a new feature learning method for cover song identification (CSI).
Ranked #2 on Cover song identification on Da-TACOS
1 code implementation • 27 Oct 2020 • Yuanbo Hou, Yi Deng, Bilei Zhu, Zejun Ma, Dick Botteldooren
Detecting anchor's voice in live musical streams is an important preprocessing for music and speech signal processing.
Sound Multimedia Audio and Speech Processing
no code implementations • 26 Oct 2020 • Zhesong Yu, Xingjian Du, Bilei Zhu, Zejun Ma
The rise of video-sharing platforms has attracted more and more people to shoot videos and upload them to the Internet.
no code implementations • 19 May 2020 • Wenjie Li, Benlai Tang, Xiang Yin, Yushi Zhao, Wei Li, Kang Wang, Hao Huang, Yuxuan Wang, Zejun Ma
Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre.
no code implementations • 23 Apr 2020 • Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, Zejun Ma
This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system based on duration allocated Tacotron-like acoustic models and WaveRNN neural vocoders.
no code implementations • 11 Nov 2019 • Junhui Zhang, Junjie Pan, Xiang Yin, Chen Li, Shichao Liu, Yang Zhang, Yuxuan Wang, Zejun Ma
In this paper, we propose a hybrid text normalization system using multi-head self-attention.
no code implementations • 11 Nov 2019 • Junjie Pan, Xiang Yin, Zhiling Zhang, Shichao Liu, Yang Zhang, Zejun Ma, Yuxuan Wang
In Mandarin text-to-speech (TTS) system, the front-end text processing module significantly influences the intelligibility and naturalness of synthesized speech.
no code implementations • 17 May 2017 • Xu Tian, Jun Zhang, Zejun Ma, Yi He, Juan Wei
The system which combined frame retaining with frame stacking could reduces the time consumption of both training and decoding.
no code implementations • 21 Mar 2017 • Xu Tian, Jun Zhang, Zejun Ma, Yi He, Juan Wei, Peihao Wu, Wenchang Situ, Shuai Li, Yang Zhang
It is a competitive framework that LSTM models of more than 7 layers are successfully trained on Shenma voice search data in Mandarin and they outperform the deep LSTM models trained by conventional approach.
no code implementations • 3 Mar 2017 • Xu Tian, Jun Zhang, Zejun Ma, Yi He, Juan Wei
As training data rapid growth, large-scale parallel training with multi-GPUs cluster is widely applied in the neural network model learning currently. We present a new approach that applies exponential moving average method in large-scale parallel training of neural network model.