Audio captioning

40 papers with code • 2 benchmarks • 4 datasets

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Benchmarks

Add a Result

These leaderboards are used to track progress in Audio captioning

Trend	Dataset	Best Model	Paper	Code	Compare
	AudioCaps	EnCLAP-large			See all
	Clotho	Ensemble			See all

Libraries

Use these libraries to find Audio captioning models and implementations

richermans/AudioCaption

2 papers

Datasets

Subtasks

Zero-shot Audio Captioning

Most implemented papers

Most implemented Social Latest No code

Clotho: An Audio Captioning Dataset

labbeti/aac-datasets • • 21 Oct 2019

Audio captioning is the novel task of general audio content description using free text.

Paper
Code

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

xinhaomei/wavcaps • • 30 Mar 2023

To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.

Paper
Code

CL4AC: A Contrastive Loss for Audio Captioning

liuxubo717/cl4ac • • 21 Jul 2021

Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip.

Paper
Code

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

qwenlm/qwen-audio • • 14 Nov 2023

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans.

Paper
Code

Audio Caption in a Car Setting with a Sentence-Level Loss

richermans/AudioCaption • • 31 May 2019

Captioning has attracted much attention in image and video understanding while a small amount of work examines audio captioning.

Paper
Code

Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

DK-Nguyen/audio-captioning-sub-sampling • • 6 Jul 2020

In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence.

Paper
Code

Multi-task Regularization Based on Infrequent Classes for Audio Captioning

emrcak/dcase-2020-baseline • • 9 Jul 2020

Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio.

Paper
Code

WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

haantran96/wavetransformer • • 21 Oct 2020

Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i. e. a caption) of its contents.

Paper
Code

MusCaps: Generating Captions for Music Audio

ilaria-manco/muscaps • • 24 Apr 2021

Content-based music information retrieval has seen rapid progress with the adoption of deep learning.

Paper
Code

THE SJTU SYSTEM FOR DCASE2021 CHALLENGE TASK 6: AUDIO CAPTIONING BASED ON ENCODER PRE-TRAINING AND REINFORCEMENT LEARNING

wsntxxn/AudioCaption • • DCASE Challenge 2021

This report proposes an audio captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge task Task 6.

Paper
Code

Audio captioning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result