C2T-Net: Channel-Aware Cross-Fused Transformer-Style Networks for Pedestrian Attribute Recognition

WACV202 2023  ·  Doanh C. Bui, Thinh V. Le, Ba Hung Ngo ·

Pedestrian attribute recognition (PAR) poses a significant challenge but holds practical significance in various security applications, including surveillance. In the scope of the UPAR challenge, this paper introduces the Channel-Aware Cross-Fused Transformer-Style Networks (C2T-Net). This network effectively integrates two powerful transformer-style networks, namely the Swin Transformer (SwinT) and a customized variant of the vanilla vision transformer (EVA ViT). The aim is to capture both local and global aspects of an individual for precise attribute recognition. To facilitate the understanding of intricate relationships among channels, a channel-aware self-attention mechanism is devised and integrated into each SwinT block. Furthermore, the fusion of features from the two transformer-style networks is accomplished through cross-fusion, enabling each network to mutually amplify and boost the textural nuances present in the other. The efficacy of the proposed model has been demonstrated through its performance on three PAR benchmarks: PA100K, PETA, and the UPAR2024 private test. With respect to the PA100K benchmark, our approach has achieved state-of-the-art results when compared to models that do not employ any pre-training techniques. Our performance on the PETA dataset remains competitive, standing on par with other cutting-edge models. Notably, our model achieved runner-up performance on the UPAR2024-track-1 test set. Source code is available at https://github.com/caodoanh2001/upar_challenge.

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Pedestrian Attribute Recognition PA-100K C2T-Net Accuracy 87.2 # 4
Pedestrian Attribute Recognition PETA C2T-Net Accuracy 88.20% # 2

Methods