TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Classification	Kinetics-400	STAM (64 Frames)	Acc@1	80.5	# 89
Action Classification	Kinetics-400	STAM (64 Frames)	FLOPs (G) x views	1040x1	# 1
Action Classification	Kinetics-400	STAM (16 Frames)	Acc@1	79.3	# 107
Action Classification	Kinetics-400	STAM (16 Frames)	FLOPs (G) x views	270x1	# 1
Action Recognition	UCF101	STAM-32 (ImageNet/Kinetics pretraining)	3-fold Accuracy	97	# 23

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-is-worth-16x16-words-what-is-a-video/action-recognition-in-videos-on-ucf101)](https://paperswithcode.com/sota/action-recognition-in-videos-on-ucf101?p=an-image-is-worth-16x16-words-what-is-a-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-is-worth-16x16-words-what-is-a-video/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=an-image-is-worth-16x16-words-what-is-a-video)`

An Image is Worth 16x16 Words, What is a Video Worth?

25 Mar 2021 · Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor ·

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach $80.5$ top-1 accuracy with $\times 30$ less frames per video, and $\times 40$ faster inference than the current leading method. Code is available at: https://github.com/Alibaba-MIIL/STAM

PDF Abstract

Code

Add Remove Mark official

Alibaba-MIIL/STAM official

219

lucidrains/STAM-pytorch

122

Tasks

Add Remove

Action Classification

Action Recognition

Datasets

UCF101

Kinetics

Kinetics 400

Results from the Paper

Edit

Ranked #23 on Action Recognition on UCF101 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Classification	Kinetics-400	STAM (64 Frames)	Acc@1	80.5	# 89	Compare
Action Classification	Kinetics-400	STAM (64 Frames)	FLOPs (G) x views	1040x1	# 1	Compare
Action Classification	Kinetics-400	STAM (16 Frames)	Acc@1	79.3	# 107	Compare
Action Classification	Kinetics-400	STAM (16 Frames)	FLOPs (G) x views	270x1	# 1	Compare
Action Recognition	UCF101	STAM-32 (ImageNet/Kinetics pretraining)	3-fold Accuracy	97	# 23	Compare

Methods

Add Remove

3D Convolution • Convolution • GELU

Edit Social Preview

An Image is Worth 16x16 Words, What is a Video Worth?

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove