no code implementations • 31 Jan 2024 • Ankit Gupta, George Saon, Brian Kingsbury
The emergence of industrial-scale speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines.
no code implementations • 21 Nov 2023 • Xiaodong Cui, Ashish Mittal, Songtao Lu, Wei zhang, George Saon, Brian Kingsbury
Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data.
no code implementations • 19 Sep 2023 • Siddhant Arora, George Saon, Shinji Watanabe, Brian Kingsbury
Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 7 Sep 2023 • Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Masayasu Muraoka, George Saon
However, existing works only transfer a single representation of LLM (e. g. the last layer of pretrained BERT), while the representation of a text is inherently non-unique and can be obtained variously from different layers, contexts and models.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 27 Feb 2023 • George Saon, Ankit Gupta, Xiaodong Cui
We improve on the popular conformer architecture by replacing the depthwise temporal convolutions with diagonal state space (DSS) models.
no code implementations • 3 Aug 2022 • Jiatong Shi, George Saon, David Haws, Shinji Watanabe, Brian Kingsbury
Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses.
no code implementations • 28 Jul 2022 • Zvi Kons, Hagai Aronowitz, Edmilson Morais, Matheus Damasceno, Hong-Kwang Kuo, Samuel Thomas, George Saon
We propose using a recurrent neural network transducer (RNN-T)-based speech-to-text (STT) system as a common component that can be used for emotion recognition and language identification as well as for speech recognition.
no code implementations • 16 Jun 2022 • Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Kailash Gopalakrishnan
We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T).
no code implementations • 1 Apr 2022 • Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Nobuyasu Itoh, George Saon
Large-scale language models (LLMs) such as GPT-2, BERT and RoBERTa have been successfully applied to ASR N-best rescoring.
no code implementations • 29 Mar 2022 • Xiaodong Cui, George Saon, Tohru Nagano, Masayuki Suzuki, Takashi Fukuda, Brian Kingsbury, Gakuto Kurata
We introduce two techniques, length perturbation and n-best based label smoothing, to improve generalization of deep neural network (DNN) acoustic models for automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 26 Feb 2022 • Samuel Thomas, Hong-Kwang J. Kuo, Brian Kingsbury, George Saon
In this paper, we propose a novel text representation and training methodology that allows E2E SLU systems to be effectively constructed using these text resources.
no code implementations • 26 Feb 2022 • Samuel Thomas, Brian Kingsbury, George Saon, Hong-Kwang J. Kuo
We observe 20-45% relative word error rate (WER) reduction in these settings with this novel LM style customization technique using only unpaired text data from the new domains.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 28 Jan 2022 • Hong-Kwang J. Kuo, Zoltan Tuske, Samuel Thomas, Brian Kingsbury, George Saon
The goal of spoken language understanding (SLU) systems is to determine the meaning of the input speech signal, unlike speech recognition which aims to produce verbatim transcripts.
no code implementations • 21 Oct 2021 • Xiaodong Cui, Wei zhang, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung
Specifically, we study three variants of asynchronous decentralized parallel SGD (ADPSGD), namely, fixed and randomized communication patterns on a ring as well as a delay-by-one scheme.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 4 Oct 2021 • Thomas Bohnstingl, Ayush Garg, Stanisław Woźniak, George Saon, Evangelos Eleftheriou, Angeliki Pantazi
Automatic speech recognition (ASR) is a capability which enables a program to process human speech into a written form.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 27 Aug 2021 • Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Xiao Sun, Naigang Wang, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei zhang, Zoltán Tüske, Kailash Gopalakrishnan
We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 24 Aug 2021 • Xiaodong Cui, Brian Kingsbury, George Saon, David Haws, Zoltan Tuske
By reducing the exposure bias, we show that we can further improve the accuracy of a high-performance RNNT ASR model and obtain state-of-the-art results on the 300-hour Switchboard dataset.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 18 Aug 2021 • Jatin Ganhotra, Samuel Thomas, Hong-Kwang J. Kuo, Sachindra Joshi, George Saon, Zoltán Tüske, Brian Kingsbury
End-to-end spoken language understanding (SLU) systems that process human-human or human-computer interactions are often context independent and process each turn of a conversation independently.
no code implementations • 3 May 2021 • Zoltán Tüske, George Saon, Brian Kingsbury
Compensation of the decoder model with the probability ratio approach allows more efficient integration of an external language model, and we report 5. 9% and 11. 5% WER on the SWB and CHM parts of Hub5'00 with very simple LSTM models.
Ranked #1 on Speech Recognition on Switchboard + Hub500
1 code implementation • 8 Apr 2021 • Samuel Thomas, Hong-Kwang J. Kuo, George Saon, Zoltán Tüske, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory
We present a comprehensive study on building and adapting RNN transducer (RNN-T) models for spoken language understanding(SLU).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 17 Mar 2021 • George Saon, Zoltan Tueske, Daniel Bolanos, Brian Kingsbury
The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination and general training recipe.
no code implementations • 24 Feb 2020 • Xiaodong Cui, Wei zhang, Ulrich Finkler, George Saon, Michael Picheny, David Kung
The past decade has witnessed great progress in Automatic Speech Recognition (ASR) due to advances in deep learning.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 4 Feb 2020 • Wei Zhang, Xiaodong Cui, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, Youssef Mroueh, Alper Buyuktosunoglu, Payel Das, David Kung, Michael Picheny
Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks.
no code implementations • 20 Jan 2020 • Zoltán Tüske, George Saon, Kartik Audhkhasi, Brian Kingsbury
It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training.
Ranked #2 on Speech Recognition on swb_hub_500 WER fullSWBCH
no code implementations • 9 Aug 2019 • Michael Picheny, Zóltan Tüske, Brian Kingsbury, Kartik Audhkhasi, Xiaodong Cui, George Saon
This paper proposes that the community place focus on the MALACH corpus to develop speech recognition systems that are more robust with respect to accents, disfluencies and emotional speech.
no code implementations • 10 Jul 2019 • Wei Zhang, Xiaodong Cui, Ulrich Finkler, George Saon, Abdullah Kayi, Alper Buyuktosunoglu, Brian Kingsbury, David Kung, Michael Picheny
On commonly used public SWB-300 and SWB-2000 ASR datasets, ADPSGD can converge with a batch size 3X as large as the one used in SSGD, thus enable training at a much larger scale.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 30 Apr 2019 • Samuel Thomas, Masayuki Suzuki, Yinghui Huang, Gakuto Kurata, Zoltan Tuske, George Saon, Brian Kingsbury, Michael Picheny, Tom Dibert, Alice Kaiser-Schatzlein, Bern Samko
With recent advances in deep learning, considerable attention has been given to achieving automatic speech recognition performance close to human performance on tasks like conversational telephone speech (CTS) recognition.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 10 Apr 2019 • Wei Zhang, Xiaodong Cui, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung, Michael Picheny
We show that we can train the LSTM model using ADPSGD in 14 hours with 16 NVIDIA P100 GPUs to reach a 7. 6% WER on the Hub5- 2000 Switchboard (SWB) test set and a 13. 1% WER on the CallHome (CH) test set.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 8 Dec 2017 • Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, Michael Picheny
This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making training and decoding with such models simple.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 17 Oct 2017 • Xiaodong Cui, Vaibhava Goel, George Saon
An embedding-based speaker adaptive training (SAT) approach is proposed and investigated in this paper for deep neural network acoustic modeling.
no code implementations • 19 Sep 2017 • Gakuto Kurata, Bhuvana Ramabhadran, George Saon, Abhinav Sethy
Language models (LMs) based on Long Short Term Memory (LSTM) have shown good gains in many automatic speech recognition tasks.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 22 Mar 2017 • Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, David Nahamoo
Our CTC word model achieves a word error rate of 13. 0%/18. 8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or decoder compared with 9. 6%/16. 0% for phone-based CTC with a 4-gram LM.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 6 Mar 2017 • George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall
This then raises two issues - what IS human performance, and how far down can we still drive speech recognition error rates?
Ranked #3 on Speech Recognition on Switchboard + Hub500
no code implementations • 27 Apr 2016 • George Saon, Tom Sercu, Steven Rennie, Hong-Kwang J. Kuo
We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6. 6% on the Switchboard subset of the Hub5 2000 evaluation testset.
Ranked #5 on Speech Recognition on swb_hub_500 WER fullSWBCH
no code implementations • 21 May 2015 • George Saon, Hong-Kwang J. Kuo, Steven Rennie, Michael Picheny
We describe the latest improvements to the IBM English conversational telephone speech recognition system.
Ranked #11 on Speech Recognition on Switchboard + Hub500
no code implementations • 5 Sep 2013 • Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, George Saon, Hagen Soltau, Tomas Beran, Aleksandr Y. Aravkin, Bhuvana Ramabhadran
We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline.