TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Audio-Visual Active Speaker Detection	AVA-ActiveSpeaker	GSCMIA	validation mean average precision	92.86%	# 8
Audio-Visual Active Speaker Detection	VPCD	GSCMIA	mean average precision	83.90	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/audio-visual-activity-guided-cross-modal/audio-visual-active-speaker-detection-on-vpcd)](https://paperswithcode.com/sota/audio-visual-active-speaker-detection-on-vpcd?p=audio-visual-activity-guided-cross-modal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/audio-visual-activity-guided-cross-modal/audio-visual-active-speaker-detection-on-ava)](https://paperswithcode.com/sota/audio-visual-active-speaker-detection-on-ava?p=audio-visual-activity-guided-cross-modal)`

Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

1 Dec 2022 · Rahul Sharma, Shrikanth Narayanan ·

Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. Through experiments on entertainment media videos from two benchmark datasets, the AVA active speaker (movies) and Visual Person Clustering Dataset (TV shows), we show that a simple late fusion of the two approaches enhances the active speaker detection performance.

PDF Abstract

Code

Add Remove Mark official

rash1993/movie-asd official

Tasks

Add Remove

Audio-Visual Active Speaker Detection

Datasets

AVA

AVA-ActiveSpeaker

VPCD

Results from the Paper

Edit

Ranked #1 on Audio-Visual Active Speaker Detection on VPCD

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Result	Benchmark
Audio-Visual Active Speaker Detection	AVA-ActiveSpeaker	GSCMIA	validation mean average precision	92.86%	# 8		Compare
Audio-Visual Active Speaker Detection	VPCD	GSCMIA	mean average precision	83.90	# 1		Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove