Multimodal Reasoning
37 papers with code • 3 benchmarks • 4 datasets
Reasoning over multimodal inputs.
Most implemented papers
e-SNLI-VE: Corrected Visual-Textual Entailment with Natural Language Explanations
The recently proposed SNLI-VE corpus for recognising visual-textual entailment is a large, real-world dataset for fine-grained multimodal reasoning.
Dual Attention Networks for Multimodal Reasoning and Matching
We propose Dual Attention Networks (DANs) which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language.
WebQA: Multihop and Multimodal QA
Scaling Visual Question Answering (VQA) to the open-domain and multi-hop nature of web searches, requires fundamental advances in visual representation learning, knowledge aggregation, and language generation.
Multimodal Analogical Reasoning over Knowledge Graphs
Analogical reasoning is fundamental to human cognition and holds an important place in various fields.
Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models
Therefore, we propose Graph-of-Thought (GoT) reasoning, which models human thought processes not only as a chain but also as a graph.
Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning
We present a new dataset, AlgoPuzzleVQA designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning.
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns
As recognizing patterns and abstracting concepts are key to general intelligence, we introduce PuzzleVQA, a collection of puzzles based on abstract patterns.
DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog
Visual Dialog is a vision-language task that requires an AI agent to engage in a conversation with humans grounded in an image.
A Multimodal Framework for the Detection of Hateful Memes
An increasingly common expression of online hate speech is multimodal in nature and comes in the form of memes.
UniT: Multimodal Multitask Learning with a Unified Transformer
We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning.