Prompt Guided Transformer for Multi-Task Dense Prediction

28 Jul 2023  ·  Yuxiang Lu, Shalayiding Sirejiding, Yue Ding, Chunlin Wang, Hongtao Lu ·

Task-conditional architecture offers advantage in parameter efficiency but falls short in performance compared to state-of-the-art multi-decoder methods. How to trade off performance and model parameters is an important and difficult problem. In this paper, we introduce a simple and lightweight task-conditional model called Prompt Guided Transformer (PGT) to optimize this challenge. Our approach designs a Prompt-conditioned Transformer block, which incorporates task-specific prompts in the self-attention mechanism to achieve global dependency modeling and parameter-efficient feature adaptation across multiple tasks. This block is integrated into both the shared encoder and decoder, enhancing the capture of intra- and inter-task features. Moreover, we design a lightweight decoder to further reduce parameter usage, which accounts for only 2.7% of the total model parameters. Extensive experiments on two multi-task dense prediction benchmarks, PASCAL-Context and NYUD-v2, demonstrate that our approach achieves state-of-the-art results among task-conditional methods while using fewer parameters, and maintains a significant balance between performance and parameter size.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Semantic Segmentation NYU Depth v2 PGT (Swin-S) Mean IoU 46.43 # 78
Semantic Segmentation NYU Depth v2 PGT (Swin-T) Mean IoU 41.61 # 94
Surface Normal Estimation NYU-Depth V2 PGT (Swin-S) Mean Angle Error 19.24 # 3
Boundary Detection NYU-Depth V2 PGT (Swin-S) odsF 78.04 # 2
Boundary Detection NYU-Depth V2 PGT (Swin-T) odsF 77.05 # 3
Monocular Depth Estimation NYU-Depth V2 PGT (Swin-T) RMSE 0.59 # 69
Surface Normal Estimation NYU-Depth V2 PGT (Swin-T) Mean Angle Error 20.06 # 4
Monocular Depth Estimation NYU-Depth V2 PGT (Swin-S) RMSE 0.5468 # 64

Methods