TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Question Answer	EgoSchema (fullset)	MVU (13B)	Accuracy	0.376	# 12
Zero-Shot Video Question Answer	EgoSchema (subset)	MVU (13B)	Accuracy	60.3	# 2
Zero-Shot Video Question Answer	EgoSchema (subset)	MVU (13B)	Inference Speed (s)	2.42	# 1
Zero-Shot Video Question Answer	NExT-QA	MVU (13B)	Accuracy	55.2	# 11

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/understanding-long-videos-in-one-multimodal/zero-shot-video-question-answer-on-egoschema)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-egoschema?p=understanding-long-videos-in-one-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/understanding-long-videos-in-one-multimodal/zero-shot-video-question-answer-on-next-qa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-next-qa?p=understanding-long-videos-in-one-multimodal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/understanding-long-videos-in-one-multimodal/zero-shot-video-question-answer-on-egoschema-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-egoschema-1?p=understanding-long-videos-in-one-multimodal)`

Understanding Long Videos in One Multimodal Language Model Pass

25 Mar 2024 · Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo ·

Large Language Models (LLMs), known to contain a strong awareness of world knowledge, have allowed recent approaches to achieve excellent performance on Long-Video Understanding benchmarks, but at high inference costs. In this work, we first propose Likelihood Selection, a simple technique that unlocks faster inference in autoregressive LLMs for multiple-choice tasks common in long-video benchmarks. In addition to faster inference, we discover the resulting models to yield surprisingly good accuracy on long-video tasks, even with no video specific information. Building on this, we inject video-specific object-centric information extracted from off-the-shelf pre-trained models and utilize natural language as a medium for information fusion. Our resulting Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art performance across long-video and fine-grained action recognition benchmarks. Code available at: https://github.com/kahnchana/mvu

PDF Abstract

Code

Add Remove Mark official

kahnchana/mvu official

Tasks

Add Remove

Action Recognition

Fine-grained Action Recognition

Language Modelling

Multiple-choice

Video Understanding

World Knowledge

zero-shot long video question answering

Zero-Shot Video Question Answer

Datasets

NExT-QA EgoSchema Open-X-Embodiment

Results from the Paper

Edit

Ranked #2 on Zero-Shot Video Question Answer on EgoSchema (subset)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Question Answer	EgoSchema (fullset)	MVU (13B)	Accuracy	0.376	# 12	Compare
Zero-Shot Video Question Answer	EgoSchema (subset)	MVU (13B)	Accuracy	60.3	# 2	Compare
Zero-Shot Video Question Answer	EgoSchema (subset)	MVU (13B)	Inference Speed (s)	2.42	# 1	Compare
Zero-Shot Video Question Answer	NExT-QA	MVU (13B)	Accuracy	55.2	# 11	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Understanding Long Videos in One Multimodal Language Model Pass

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove