no code implementations • 3 Apr 2024 • Gabriela Ben Melech Stan, Raanan Yehezkel Rohekar, Yaniv Gurwicz, Matthew Lyle Olson, Anahita Bhiwandiwalla, Estelle Aflalo, Chenfei Wu, Nan Duan, Shao-Yen Tseng, Vasudev Lal
In this work, we present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models.
no code implementations • 29 Mar 2024 • Musashi Hinck, Matthew L. Olson, David Cobbley, Shao-Yen Tseng, Vasudev Lal
We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs).
no code implementations • 6 Nov 2023 • Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, Vasudev Lal
Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions.
1 code implementation • 31 May 2023 • Xiao Xu, Bei Li, Chenfei Wu, Shao-Yen Tseng, Anahita Bhiwandiwalla, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan
With only 4M VLP data, ManagerTower achieves superior performances on various downstream VL tasks, especially 79. 15% accuracy on VQAv2 Test-Std, 86. 56% IR@1 and 95. 64% TR@1 on Flickr30K.
2 code implementations • 18 May 2023 • Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, Vasudev Lal
This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts.
no code implementations • 24 Aug 2022 • Avinash Madasu, Estelle Aflalo, Gabriela Ben Melech Stan, Shachar Rosenman, Shao-Yen Tseng, Gedas Bertasius, Vasudev Lal
In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval.
1 code implementation • CVPR 2022 • Estelle Aflalo, Meng Du, Shao-Yen Tseng, Yongfei Liu, Chenfei Wu, Nan Duan, Vasudev Lal
Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems.
no code implementations • 8 Feb 2022 • Vin Sachidananda, Shao-Yen Tseng, Erik Marchi, Sachin Kajarekar, Panayiotis Georgiou
By aligning audio representations to pretrained language representations and utilizing contrastive information between acoustic inputs, CALM is able to bootstrap audio embedding competitive with existing audio representation models in only a few hours of training time.
1 code implementation • Findings (NAACL) 2022 • Yongfei Liu, Chenfei Wu, Shao-Yen Tseng, Vasudev Lal, Xuming He, Nan Duan
Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning.
1 code implementation • 10 Sep 2019 • Shao-Yen Tseng, Panayiotis Georgiou, Shrikanth Narayanan
Word embeddings such as ELMo have recently been shown to model word semantics with greater efficacy through contextualized learning on large-scale language corpora, resulting in significant improvement in state of the art across many natural language tasks.
no code implementations • 31 Aug 2019 • Prashanth Gurunath Shivakumar, Shao-Yen Tseng, Panayiotis Georgiou, Shrikanth Narayanan
In this work we derive motivation from psycholinguistics and propose the addition of behavioral information into the context of language modeling.
no code implementations • 2 Aug 2019 • Sandeep Nallan Chakravarthula, Haoqi Li, Shao-Yen Tseng, Maija Reblin, Panayiotis Georgiou
Cancer impacts the quality of life of those diagnosed as well as their spouse caregivers, in addition to potentially influencing their day-to-day behaviors.
no code implementations • 18 Jul 2018 • Shao-Yen Tseng, Brian Baucom, Panayiotis Georgiou
Unsupervised learning has been an attractive method for easily deriving meaningful data representations from vast amounts of unlabeled data.