no code implementations • 6 May 2024 • Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fahad Shahbaz Khan, Salman Khan
Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks.
1 code implementation • 4 Jan 2024 • Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc van Gool, Federico Tombari
While effective, most of these works require labeled data which is not practical, and often struggle to generalize towards new datasets due to over-fitting on the source data.
no code implementations • NeurIPS 2023 • Jameel Hassan, Hanan Gani, Noor Hussein, Muhammad Uzair Khattak, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan
The promising zero-shot generalization of vision-language models such as CLIP has led to their adoption using prompt learning for numerous downstream tasks.
2 code implementations • ICCV 2023 • Syed Talal Wasim, Muhammad Uzair Khattak, Muzammal Naseer, Salman Khan, Mubarak Shah, Fahad Shahbaz Khan
Video transformer designs are based on self-attention that can model global context at a high computational cost.
Ranked #1 on Action Recognition on Diving-48
2 code implementations • ICCV 2023 • Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan
To the best of our knowledge, this is the first regularization framework for prompt learning that avoids overfitting by jointly attending to pre-trained model features, the training trajectory during prompting, and the textual diversity.
Ranked #2 on Prompt Engineering on ImageNet V2
1 code implementation • CVPR 2023 • Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, Fahad Shahbaz Khan
Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain.
2 code implementations • CVPR 2023 • Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, Fahad Shahbaz Khan
Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks.
Ranked #2 on Prompt Engineering on ImageNet-A
1 code implementation • 7 Jul 2022 • Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan, Fahad Shahbaz Khan
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
Ranked #1 on Open Vocabulary Object Detection on OpenImages-v4