TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition	Something-Something V1	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)	Top 1 Accuracy	56.6	# 18
Action Recognition	Something-Something V1	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)	Top 5 Accuracy	84.4	# 8
Action Recognition	Something-Something V1	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)	Top 1 Accuracy	55.8	# 23
Action Recognition	Something-Something V1	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)	Top 5 Accuracy	83.9	# 11
Action Recognition	Something-Something V1	SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	Top 1 Accuracy	54.3	# 31
Action Recognition	Something-Something V1	SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	Top 5 Accuracy	82.9	# 14
Action Recognition	Something-Something V2	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)	Top-1 Accuracy	67.7	# 60
Action Recognition	Something-Something V2	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)	Top-5 Accuracy	91.1	# 43
Action Recognition	Something-Something V2	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)	Top-1 Accuracy	67.4	# 63
Action Recognition	Something-Something V2	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)	Top-5 Accuracy	91	# 47
Action Recognition	Something-Something V2	SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	Top-1 Accuracy	65.7	# 84
Action Recognition	Something-Something V2	SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	Top-5 Accuracy	89.8	# 63

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-self-similarity-in-space-and-time-as-1/action-recognition-in-videos-on-something-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something-1?p=learning-self-similarity-in-space-and-time-as-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-self-similarity-in-space-and-time-as-1/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=learning-self-similarity-in-space-and-time-as-1)`

Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition

ICCV 2021 · Heeseung Kwon, Manjin Kim, Suha Kwak, Minsu Cho ·

Spatio-temporal convolution often fails to learn motion dynamics in videos and thus an effective motion representation is required for video understanding in the wild. In this paper, we propose a rich and robust motion representation based on spatio-temporal self-similarity (STSS). Given a sequence of frames, STSS represents each local region as similarities to its neighbors in space and time. By converting appearance features into relational values, it enables the learner to better recognize structural patterns in space and time. We leverage the whole volume of STSS and let our model learn to extract an effective motion representation from it. The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision. With a sufficient volume of the neighborhood in space and time, it effectively captures long-term interaction and fast motion in the video, leading to robust action recognition. Our experimental analysis demonstrates its superiority over previous methods for motion modeling as well as its complementarity to spatio-temporal features from direct convolution. On the standard action recognition benchmarks, Something-Something-V1 & V2, Diving-48, and FineGym, the proposed method achieves the state-of-the-art results.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Code

Add Remove Mark official

arunos728/SELFY official

Tasks

Add Remove

Action Recognition

Temporal Action Localization

Video Understanding

Datasets

Something-Something V2

Something-Something V1

FineGym

Results from the Paper

Edit

Ranked #18 on Action Recognition on Something-Something V1 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition	Something-Something V1	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)	Top 1 Accuracy	56.6	# 18	Compare
Action Recognition	Something-Something V1		Top 5 Accuracy	84.4	# 8	Compare
Action Recognition	Something-Something V1	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)	Top 1 Accuracy	55.8	# 23	Compare
Action Recognition	Something-Something V1		Top 5 Accuracy	83.9	# 11	Compare
Action Recognition	Something-Something V1	SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	Top 1 Accuracy	54.3	# 31	Compare
Action Recognition	Something-Something V1	SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	Top 5 Accuracy	82.9	# 14	Compare
Action Recognition	Something-Something V2	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, 2 clips)	Top-1 Accuracy	67.7	# 60	Compare
Action Recognition	Something-Something V2		Top-5 Accuracy	91.1	# 43	Compare
Action Recognition	Something-Something V2	SELFYNet-TSM-R50En (8+16 frames, ImageNet pretrained, a single clip)	Top-1 Accuracy	67.4	# 63	Compare
Action Recognition	Something-Something V2		Top-5 Accuracy	91	# 47	Compare
Action Recognition	Something-Something V2	SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	Top-1 Accuracy	65.7	# 84	Compare
Action Recognition	Something-Something V2	SELFYNet-TSM-R50 (16 frames, ImageNet pretrained)	Top-5 Accuracy	89.8	# 63	Compare

Methods

Add Remove

Convolution

Edit Social Preview

Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove