Search Results for author: Laurent Girin

Found 23 papers, 4 papers with code

Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting

no code implementations30 May 2024 Ihab Asaad, Maxime Jacquelin, Olivier Perrotin, Laurent Girin, Thomas Hueber

In the present study, we investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context, i. e., fulfilling a downstream task that is very similar to the pretext task.

Mixture of Dynamical Variational Autoencoders for Multi-Source Trajectory Modeling and Separation

no code implementations7 Dec 2023 Xiaoyu Lin, Laurent Girin, Xavier Alameda-Pineda

In this paper, we propose a latent-variable generative model called mixture of dynamical variational autoencoders (MixDVAE) to model the dynamics of a system composed of multiple moving sources.

Audio Source Separation Multi-Object Tracking +1

Unsupervised speech enhancement with deep dynamical generative speech and noise models

no code implementations13 Jun 2023 Xiaoyu Lin, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda

This work builds on a previous work on unsupervised speech enhancement using a dynamical variational autoencoder (DVAE) as the clean speech model and non-negative matrix factorization (NMF) as the noise model.

Speech Enhancement

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

no code implementations5 May 2023 Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, Renaud Séguier

The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality.

Disentanglement Image Denoising +2

Speech Modeling with a Hierarchical Transformer Dynamical VAE

no code implementations7 Mar 2023 Xiaoyu Lin, Xiaoyu Bie, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda

The dynamical variational autoencoders (DVAEs) are a family of latent-variable deep generative models that extends the VAE to model a sequence of observed data and a corresponding sequence of latent vectors.

Speech Enhancement

BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model

no code implementations4 Jul 2022 Brooke Stephenson, Laurent Besacier, Laurent Girin, Thomas Hueber

We collect a corpus of utterances containing contrastive focus and we evaluate the accuracy of a BERT model, finetuned to predict quantized acoustic prominence features, on these samples.

Language Modelling Speech Synthesis +1

Learning and controlling the source-filter representation of speech with a variational autoencoder

1 code implementation14 Apr 2022 Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, Renaud Séguier

Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we propose a method to identify the latent subspaces encoding $f_0$ and the first three formant frequencies, we show that these subspaces are orthogonal, and based on this orthogonality, we develop a method to accurately and independently control the source-filter speech factors within the latent subspaces.

Repeat after me: Self-supervised learning of acoustic-to-articulatory mapping by vocal imitation

no code implementations5 Apr 2022 Marc-Antoine Georges, Julien Diard, Laurent Girin, Jean-Luc Schwartz, Thomas Hueber

We propose a computational model of speech production combining a pre-trained neural articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters, a DNN-based internal forward model predicting the sensory consequences of articulatory commands, and an internal inverse model based on a recurrent neural network recovering articulatory commands from the acoustic speech input.

Self-Supervised Learning

Unsupervised Multiple-Object Tracking with a Dynamical Variational Autoencoder

no code implementations18 Feb 2022 Xiaoyu Lin, Laurent Girin, Xavier Alameda-Pineda

In this paper, we present an unsupervised probabilistic model and associated estimation algorithm for multi-object tracking (MOT) based on a dynamical variational autoencoder (DVAE), called DVAE-UMOT.

Multi-Object Tracking Multiple Object Tracking +3

A Survey of Sound Source Localization with Deep Learning Methods

no code implementations8 Sep 2021 Pierre-Amaury Grumiaux, Srđan Kitić, Laurent Girin, Alexandre Guérin

This article is a survey on deep learning methods for single and multiple sound source localization.

Unsupervised Speech Enhancement using Dynamical Variational Auto-Encoders

1 code implementation23 Jun 2021 Xiaoyu Bie, Simon Leglaive, Xavier Alameda-Pineda, Laurent Girin

We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement.

Representation Learning Speech Enhancement +2

What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS

no code implementations4 Sep 2020 Brooke Stephenson, Laurent Besacier, Laurent Girin, Thomas Hueber

In this paper, we study the behavior of a neural sequence-to-sequence TTS system when used in an incremental mode, i. e. when generating speech output for token n, the system has access to n + k tokens from the text sequence.

Decoder Sentence +2

Dynamical Variational Autoencoders: A Comprehensive Review

1 code implementation28 Aug 2020 Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, Xavier Alameda-Pineda

Recently, a series of papers have presented different extensions of the VAE to process sequential data, which model not only the latent space but also the temporal dependencies within a sequence of data vectors and corresponding latent vectors, relying on recurrent neural networks or state-space models.

3D Human Dynamics Resynthesis +2

A Recurrent Variational Autoencoder for Speech Enhancement

no code implementations24 Oct 2019 Simon Leglaive, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud

This paper presents a generative approach to speech enhancement based on a recurrent variational autoencoder (RVAE).

Speech Enhancement

Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoders

no code implementations7 Aug 2019 Mostafa Sadeghi, Simon Leglaive, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud

Variational auto-encoders (VAEs) are deep generative latent variable models that can be used for learning the distribution of complex data.

Speech Enhancement

A variance modeling framework based on variational autoencoders for speech enhancement

1 code implementation5 Feb 2019 Simon Leglaive, Laurent Girin, Radu Horaud

In this paper we address the problem of enhancing speech signals in noisy mixtures using a source separation approach.

Speech Enhancement

Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers

no code implementations28 Sep 2018 Yutong Ban, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud

We propose a variational inference model which amounts to approximate the joint distribution with a factorized distribution.

Bayesian Inference Variational Inference +1

Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models

no code implementations11 Jun 2018 Fanny Roche, Thomas Hueber, Samuel Limier, Laurent Girin

This study investigates the use of non-linear unsupervised dimensionality reduction techniques to compress a music dataset into a low-dimensional representation which can be used in turn for the synthesis of new sounds.

Audio and Speech Processing Sound

Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression

no code implementations12 Aug 2014 Antoine Deleforge, Radu Horaud, Yoav Schechner, Laurent Girin

Indeed, we demonstrate that the method can be used for audio-visual fusion, namely to map speech signals onto images and hence to spatially align the audio and visual modalities, thus enabling to discriminate between speaking and non-speaking faces.

regression

Cannot find the paper you are looking for? You can Submit a new open access paper.