Multi-task learning improves synthetic speech detection

With the development of deep learning, synthetic speech has become more and more realistic and easier to spoof Automatic Speaker Verification (ASV) devices. Based on mining more effective hand-crafted features and proposing more powerful networks, many algorithms have been proposed to detect this malicious attack. In this paper, by observing that deepening the network impairs the performance of the network in detecting unknown attacks, we propose that the synthetic speech detection problem is an out-of-distribution (OOD) generalization problem and we enhance the robustness of networks by using multi-task learning. In our system, three auxiliary tasks are used to assist synthetic speech detection: bonafide speech reconstruction, spoofing voice conversion and speaker classification. Experimental results show that our approach can be applied to multiple architectures and can significantly improve the performance on both known attacks (development set) and unknown attacks (evaluation set). In addition, our best-performing network is quite competitive to recent state-of-the-art (SOTA) systems. It demonstrates the potential application of multi-task learning in synthetic speech detection.

PDF

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here