Caption Generation
89 papers with code • 1 benchmarks • 1 datasets
Libraries
Use these libraries to find Caption Generation models and implementationsMost implemented papers
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images.
Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks
Over the last decade, Convolutional Neural Network (CNN) models have been highly successful in solving complex vision problems.
Recurrent Neural Network Regularization
We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units.
Microsoft COCO Captions: Data Collection and Evaluation Server
In this paper we describe the Microsoft COCO Caption dataset and evaluation server.
Where to put the Image in an Image Caption Generator
When a recurrent neural network language model is used for caption generation, the image information can be fed to the neural network either by directly incorporating it in the RNN -- conditioning the language model by `injecting' image features -- or in a layer following the RNN -- conditioning the language model by `merging' image features.
Scalable Bayesian Optimization Using Deep Neural Networks
Bayesian optimization is an effective methodology for the global optimization of functions with expensive evaluations.
Sequence to Sequence -- Video to Text
Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip.
An Actor-Critic Algorithm for Sequence Prediction
We present an approach to training neural networks to generate sequences using actor-critic methods from reinforcement learning (RL).
Deep Reinforcement Learning For Sequence to Sequence Models
In this survey, we consider seq2seq problems from the RL point of view and provide a formulation combining the power of RL methods in decision-making with sequence-to-sequence models that enable remembering long-term memories.
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training.