Search Results for author: George Saon

Found 36 papers, 1 papers with code

Exploring the limits of decoder-only models trained on public speech recognition corpora

no code implementations • 31 Jan 2024 • Ankit Gupta, George Saon, Brian Kingsbury

The emergence of industrial-scale speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines.

Decoder speech-recognition +1

Paper
Add Code

Soft Random Sampling: A Theoretical and Empirical Analysis

no code implementations • 21 Nov 2023 • Xiaodong Cui, Ashish Mittal, Songtao Lu, Wei zhang, George Saon, Brian Kingsbury

Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data.

Automatic Speech Recognition speech-recognition +1

Paper
Add Code

Semi-Autoregressive Streaming ASR With Label Context

no code implementations • 19 Sep 2023 • Siddhant Arora, George Saon, Shinji Watanabe, Brian Kingsbury

Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Multiple Representation Transfer from Large Language Models to End-to-End ASR Systems

no code implementations • 7 Sep 2023 • Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Masayasu Muraoka, George Saon

However, existing works only transfer a single representation of LLM (e. g. the last layer of pretrained BERT), while the representation of a text is inherently non-unique and can be obtained variously from different layers, contexts and models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Diagonal State Space Augmented Transformers for Speech Recognition

no code implementations • 27 Feb 2023 • George Saon, Ankit Gupta, Xiaodong Cui

We improve on the popular conformer architecture by replacing the depthwise temporal convolutions with diagonal state space (DSS) models.

speech-recognition Speech Recognition

Paper
Add Code

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

no code implementations • 3 Aug 2022 • Jiatong Shi, George Saon, David Haws, Shinji Watanabe, Brian Kingsbury

Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses.

Language Modelling

Paper
Add Code

Extending RNN-T-based speech recognition systems with emotion and language classification

no code implementations • 28 Jul 2022 • Zvi Kons, Hagai Aronowitz, Edmilson Morais, Matheus Damasceno, Hong-Kwang Kuo, Samuel Thomas, George Saon

We propose using a recurrent neural network transducer (RNN-T)-based speech-to-text (STT) system as a common component that can be used for emotion recognition and language identification as well as for speech recognition.

Emotion Classification Emotion Recognition +3

Paper
Add Code

Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization

no code implementations • 16 Jun 2022 • Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Kailash Gopalakrishnan

We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T).

Language Modelling Model Compression +1

Paper
Add Code

Effect and Analysis of Large-scale Language Model Rescoring on Competitive ASR Systems

no code implementations • 1 Apr 2022 • Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Nobuyasu Itoh, George Saon

Large-scale language models (LLMs) such as GPT-2, BERT and RoBERTa have been successfully applied to ASR N-best rescoring.

Language Modelling Lexical Analysis

Paper
Add Code

Improving Generalization of Deep Neural Network Acoustic Models with Length Perturbation and N-best Based Label Smoothing

no code implementations • 29 Mar 2022 • Xiaodong Cui, George Saon, Tohru Nagano, Masayuki Suzuki, Takashi Fukuda, Brian Kingsbury, Gakuto Kurata

We introduce two techniques, length perturbation and n-best based label smoothing, to improve generalization of deep neural network (DNN) acoustic models for automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Towards Reducing the Need for Speech Training Data To Build Spoken Language Understanding Systems

no code implementations • 26 Feb 2022 • Samuel Thomas, Hong-Kwang J. Kuo, Brian Kingsbury, George Saon

In this paper, we propose a novel text representation and training methodology that allows E2E SLU systems to be effectively constructed using these text resources.

Spoken Language Understanding

Paper
Add Code

Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models

no code implementations • 26 Feb 2022 • Samuel Thomas, Brian Kingsbury, George Saon, Hong-Kwang J. Kuo

We observe 20-45% relative word error rate (WER) reduction in these settings with this novel LM style customization technique using only unpaired text data from the new domains.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Improving End-to-End Models for Set Prediction in Spoken Language Understanding

no code implementations • 28 Jan 2022 • Hong-Kwang J. Kuo, Zoltan Tuske, Samuel Thomas, Brian Kingsbury, George Saon

The goal of spoken language understanding (SLU) systems is to determine the meaning of the input speech signal, unlike speech recognition which aims to produce verbatim transcripts.

Data Augmentation Decoder +3

Paper
Add Code

Asynchronous Decentralized Distributed Training of Acoustic Models

no code implementations • 21 Oct 2021 • Xiaodong Cui, Wei zhang, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung

Specifically, we study three variants of asynchronous decentralized parallel SGD (ADPSGD), namely, fixed and randomized communication patterns on a ring as well as a delay-by-one scheme.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Towards efficient end-to-end speech recognition with biologically-inspired neural networks

no code implementations • 4 Oct 2021 • Thomas Bohnstingl, Ayush Garg, Stanisław Woźniak, George Saon, Evangelos Eleftheriou, Angeliki Pantazi

Automatic speech recognition (ASR) is a capability which enables a program to process human speech into a written form.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

4-bit Quantization of LSTM-based Speech Recognition Models

no code implementations • 27 Aug 2021 • Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Xiao Sun, Naigang Wang, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei zhang, Zoltán Tüske, Kailash Gopalakrishnan

We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Reducing Exposure Bias in Training Recurrent Neural Network Transducers

no code implementations • 24 Aug 2021 • Xiaodong Cui, Brian Kingsbury, George Saon, David Haws, Zoltan Tuske

By reducing the exposure bias, we show that we can further improve the accuracy of a high-performance RNNT ASR model and obtain state-of-the-art results on the 300-hour Switchboard dataset.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Integrating Dialog History into End-to-End Spoken Language Understanding Systems

no code implementations • 18 Aug 2021 • Jatin Ganhotra, Samuel Thomas, Hong-Kwang J. Kuo, Sachindra Joshi, George Saon, Zoltán Tüske, Brian Kingsbury

End-to-end spoken language understanding (SLU) systems that process human-human or human-computer interactions are often context independent and process each turn of a conversation independently.

Intent Recognition Spoken Language Understanding

Paper
Add Code

On the limit of English conversational speech recognition

no code implementations • 3 May 2021 • Zoltán Tüske, George Saon, Brian Kingsbury

Compensation of the decoder model with the probability ratio approach allows more efficient integration of an external language model, and we report 5. 9% and 11. 5% WER on the SWB and CHM parts of Hub5'00 with very simple LSTM models.

Ranked #1 on Speech Recognition on Switchboard + Hub500

Decoder English Conversational Speech Recognition +2

Paper
Add Code

RNN Transducer Models For Spoken Language Understanding

1 code implementation • 8 Apr 2021 • Samuel Thomas, Hong-Kwang J. Kuo, George Saon, Zoltán Tüske, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory

We present a comprehensive study on building and adapting RNN transducer (RNN-T) models for spoken language understanding(SLU).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Code

Advancing RNN Transducer Technology for Speech Recognition

no code implementations • 17 Mar 2021 • George Saon, Zoltan Tueske, Daniel Bolanos, Brian Kingsbury

The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination and general training recipe.

Language Modelling speech-recognition +1

Paper
Add Code

Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition

no code implementations • 24 Feb 2020 • Xiaodong Cui, Wei zhang, Ulrich Finkler, George Saon, Michael Picheny, David Kung

The past decade has witnessed great progress in Automatic Speech Recognition (ASR) due to advances in deep learning.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Improving Efficiency in Large-Scale Decentralized Distributed Training

no code implementations • 4 Feb 2020 • Wei Zhang, Xiaodong Cui, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, Youssef Mroueh, Alper Buyuktosunoglu, Payel Das, David Kung, Michael Picheny

Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks.

speech-recognition Speech Recognition

Paper
Add Code

Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard

no code implementations • 20 Jan 2020 • Zoltán Tüske, George Saon, Kartik Audhkhasi, Brian Kingsbury

It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training.

Ranked #2 on Speech Recognition on swb_hub_500 WER fullSWBCH

Data Augmentation Language Modelling +2

Paper
Add Code

Challenging the Boundaries of Speech Recognition: The MALACH Corpus

no code implementations • 9 Aug 2019 • Michael Picheny, Zóltan Tüske, Brian Kingsbury, Kartik Audhkhasi, Xiaodong Cui, George Saon

This paper proposes that the community place focus on the MALACH corpus to develop speech recognition systems that are more robust with respect to accents, disfluencies and emotional speech.

speech-recognition Speech Recognition

Paper
Add Code

A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

no code implementations • 10 Jul 2019 • Wei Zhang, Xiaodong Cui, Ulrich Finkler, George Saon, Abdullah Kayi, Alper Buyuktosunoglu, Brian Kingsbury, David Kung, Michael Picheny

On commonly used public SWB-300 and SWB-2000 ASR datasets, ADPSGD can converge with a batch size 3X as large as the one used in SSGD, thus enable training at a much larger scale.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

English Broadcast News Speech Recognition by Humans and Machines

no code implementations • 30 Apr 2019 • Samuel Thomas, Masayuki Suzuki, Yinghui Huang, Gakuto Kurata, Zoltan Tuske, George Saon, Brian Kingsbury, Michael Picheny, Tom Dibert, Alice Kaiser-Schatzlein, Bern Samko

With recent advances in deep learning, considerable attention has been given to achieving automatic speech recognition performance close to human performance on tasks like conversational telephone speech (CTS) recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Distributed Deep Learning Strategies For Automatic Speech Recognition

no code implementations • 10 Apr 2019 • Wei Zhang, Xiaodong Cui, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung, Michael Picheny

We show that we can train the LSTM model using ADPSGD in 14 hours with 16 NVIDIA P100 GPUs to reach a 7. 6% WER on the Hub5- 2000 Switchboard (SWB) test set and a 13. 1% WER on the CallHome (CH) test set.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Building competitive direct acoustics-to-word models for English conversational speech recognition

no code implementations • 8 Dec 2017 • Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, Michael Picheny

This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making training and decoding with such models simple.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Embedding-Based Speaker Adaptive Training of Deep Neural Networks

no code implementations • 17 Oct 2017 • Xiaodong Cui, Vaibhava Goel, George Saon

An embedding-based speaker adaptive training (SAT) approach is proposed and investigated in this paper for deep neural network acoustic modeling.

speech-recognition Speech Recognition

Paper
Add Code

Language Modeling with Highway LSTM

no code implementations • 19 Sep 2017 • Gakuto Kurata, Bhuvana Ramabhadran, George Saon, Abhinav Sethy

Language models (LMs) based on Long Short Term Memory (LSTM) have shown good gains in many automatic speech recognition tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

no code implementations • 22 Mar 2017 • Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, David Nahamoo

Our CTC word model achieves a word error rate of 13. 0%/18. 8% on the Hub5-2000 Switchboard/CallHome test sets without any LM or decoder compared with 9. 6%/16. 0% for phone-based CTC with a 4-gram LM.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

English Conversational Telephone Speech Recognition by Humans and Machines

no code implementations • 6 Mar 2017 • George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall

This then raises two issues - what IS human performance, and how far down can we still drive speech recognition error rates?

Ranked #3 on Speech Recognition on Switchboard + Hub500

Language Modelling Multi-Task Learning +2

Paper
Add Code

The IBM 2016 English Conversational Telephone Speech Recognition System

no code implementations • 27 Apr 2016 • George Saon, Tom Sercu, Steven Rennie, Hong-Kwang J. Kuo

We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6. 6% on the Switchboard subset of the Hub5 2000 evaluation testset.

Ranked #5 on Speech Recognition on swb_hub_500 WER fullSWBCH

Language Modelling speech-recognition +1

Paper
Add Code

The IBM 2015 English Conversational Telephone Speech Recognition System

no code implementations • 21 May 2015 • George Saon, Hong-Kwang J. Kuo, Steven Rennie, Michael Picheny

We describe the latest improvements to the IBM English conversational telephone speech recognition system.

Ranked #11 on Speech Recognition on Switchboard + Hub500

Language Modelling speech-recognition +1

Paper
Add Code

Improvements to deep convolutional neural networks for LVCSR

no code implementations • 5 Sep 2013 • Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, George Saon, Hagen Soltau, Tomas Beran, Aleksandr Y. Aravkin, Bhuvana Ramabhadran

We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline.

Speech Recognition

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.