no code implementations • 15 Jan 2024 • Darshan Singh S, Zeeshan Khan, Makarand Tapaswi
We use the SRL and verb information to create rule-based detailed captions, making sure they capture most of the visual concepts.
no code implementations • 26 Nov 2023 • Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar
Understanding the factors that determine video memorability has important applications in areas such as educational technology and advertising.
no code implementations • 8 Sep 2023 • Aroof Aimen, Arsh Verma, Makarand Tapaswi, Narayanan C. Krishnan
Real-world application of chest X-ray abnormality classification requires dealing with several challenges: (i) limited training data; (ii) training and evaluation sets that are derived from different domains; and (iii) classes that appear during training may have partial overlap with classes of interest during evaluation.
1 code implementation • CVPR 2023 • Dhruv Srivastava, Aditya Kumar Singh, Makarand Tapaswi
Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a movie scene and for each character.
no code implementations • 22 Mar 2023 • Dhaval Taunk, Lakshya Khanna, Pavan Kandru, Vasudeva Varma, Charu Sharma, Makarand Tapaswi
Commonsense question-answering (QA) methods combine the power of pre-trained Language Models (LM) with the reasoning provided by Knowledge Graphs (KG).
Ranked #8 on Question Answering on OpenBookQA
1 code implementation • CVPR 2023 • Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek
Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.
Ranked #1 on Video-Text Retrieval on Test-of-Time (using extra training data)
no code implementations • 2 Dec 2022 • Jaidev Shriram, Makarand Tapaswi, Vinoo Alluri
Reading, much like music listening, is an immersive experience that transports readers while taking them on an emotional journey.
no code implementations • 23 Nov 2022 • Arsh Verma, Makarand Tapaswi
Chest radiograph (or Chest X-Ray, CXR) is a popular medical imaging modality that is used by radiologists across the world to diagnose heart or lung conditions.
1 code implementation • 17 Nov 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
no code implementations • 29 Oct 2022 • Darshan Singh S, Anchit Gupta, C. V. Jawahar, Makarand Tapaswi
We formulate lecture segmentation as an unsupervised task that leverages visual, textual, and OCR cues from the lecture, while clip representations are fine-tuned on a pretext self-supervised task of matching the narration with the temporally aligned visual content.
no code implementations • 19 Oct 2022 • Zeeshan Khan, C. V. Jawahar, Makarand Tapaswi
Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities.
2 code implementations • 11 Sep 2022 • Pierre-Louis Guhur, ShiZhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid
In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions.
Ranked #2 on Robot Manipulation on RLBench (Succ. Rate (10 tasks, 100 demos/task) metric)
1 code implementation • 24 Aug 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.
Ranked #1 on Visual Navigation on SOON Test
2 code implementations • 3 Aug 2022 • Vladimir Petrik, Mohammad Nomaan Qureshi, Josef Sivic, Makarand Tapaswi
We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations sourced from 9 actions such as pull something from right to left or put something in front of something.
1 code implementation • CVPR 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.
Ranked #4 on Visual Navigation on SOON Test
1 code implementation • 10 Nov 2021 • Rahul Vigneswaran, Marc T. Law, Vineeth N. Balasubramanian, Makarand Tapaswi
Oversampling instances of the tail classes attempts to solve this imbalance.
Ranked #1 on Long-tail Learning on mini-ImageNet-LT
2 code implementations • ICCV 2021 • Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid
Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.
Ranked #3 on Vision and Language Navigation on VLN Challenge
1 code implementation • 13 Nov 2020 • Vladimír Petrík, Makarand Tapaswi, Ivan Laptev, Josef Sivic
We evaluate our method on simple single- and two-object actions from the Something-Something dataset.
no code implementations • 5 Apr 2020 • Vivek Sharma, Makarand Tapaswi, M. Saquib Sarfraz, Rainer Stiefelhagen
We demonstrate our method on the challenging task of learning representations for video face clustering.
1 code implementation • 5 Apr 2020 • Vivek Sharma, Makarand Tapaswi, Rainer Stiefelhagen
True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions.
1 code implementation • CVPR 2020 • Anna Kukleva, Makarand Tapaswi, Ivan Laptev
Localizing the pair of interacting characters in video is a time-consuming process, instead, we train our model to learn from clip-level weak labels.
1 code implementation • 30 Dec 2019 • Atef Chaudhury, Makarand Tapaswi, Seung Wook Kim, Sanja Fidler
Understanding stories is a challenging reading comprehension problem for machines as it requires reading a large volume of text and following long-range dependencies.
1 code implementation • ICCV 2019 • Makarand Tapaswi, Marc T. Law, Sanja Fidler
Understanding videos such as TV series and movies requires analyzing who the characters are and what they are doing.
4 code implementations • ICCV 2019 • Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic
In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.
Ranked #4 on Temporal Action Localization on CrossTask
Action Localization Long Video Retrieval (Background Removed) +3
1 code implementation • 3 Mar 2019 • Vivek Sharma, Makarand Tapaswi, M. Saquib Sarfraz, Rainer Stiefelhagen
In this paper, we address video face clustering using unsupervised methods.
1 code implementation • ICLR 2019 • Seung Wook Kim, Makarand Tapaswi, Sanja Fidler
Thus, a module for a new task learns to query existing modules and composes their outputs in order to produce its own output.
no code implementations • CVPR 2018 • Yuhao Zhou, Makarand Tapaswi, Sanja Fidler
We are interested in enabling automatic 4D cinema by parsing physical and special effects from untrimmed movies.
no code implementations • CVPR 2018 • Paul Vicol, Makarand Tapaswi, Lluis Castrejon, Sanja Fidler
Towards this goal, we introduce a novel dataset called MovieGraphs which provides detailed, graph-based annotations of social situations depicted in movie clips.
1 code implementation • ICCV 2017 • Ruiyu Li, Makarand Tapaswi, Renjie Liao, Jiaya Jia, Raquel Urtasun, Sanja Fidler
We address the problem of recognizing situations in images.
Ranked #9 on Situation Recognition on imSitu
no code implementations • 22 Nov 2016 • Manuel Martinez, Monica Haurilet, Ziad Al-Halah, Makarand Tapaswi, Rainer Stiefelhagen
The Earth Mover's Distance (EMD) computes the optimal cost of transforming one distribution into another, given a known transport metric between them.
no code implementations • CVPR 2016 • Ziad Al-Halah, Makarand Tapaswi, Rainer Stiefelhagen
In this work, we aim to carry out attribute-based zero-shot classification in an unsupervised manner.
1 code implementation • CVPR 2016 • Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text.
no code implementations • CVPR 2015 • Makarand Tapaswi, Martin Bauml, Rainer Stiefelhagen
Such an alignment facilitates finding differences between the adaptation and the original source, and also acts as a basis for deriving rich descriptions from the novel for the video clips.
1 code implementation • CVPR 2014 • Makarand Tapaswi, Martin Bauml, Rainer Stiefelhagen
We present a novel way to automatically summarize and represent the storyline of a TV episode by visualizing character interactions as a chart.
no code implementations • CVPR 2013 • Martin Bauml, Makarand Tapaswi, Rainer Stiefelhagen
We address the problem of person identification in TV series.