TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	GQA test-dev	CuMo-7B	Accuracy	64.9	# 3
visual instruction following	LLaVA-Bench	CuMo-7B	avg score	85.7	# 1
Visual Question Answering	MMBench	CuMo-7B	GPT-3.5 score	73.0	# 1
Visual Question Answering	MM-Vet	CuMo-7B	GPT-4 score	51.0	# 18
Visual Question Answering	MM-Vet	CuMo-7B	Params	7B	# 1
Visual Question Answering (VQA)	VQA v2 test-dev	CuMo-7B	Accuracy	82.2	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cumo-scaling-multimodal-llm-with-co-upcycled/visual-instruction-following-on-llava-bench)](https://paperswithcode.com/sota/visual-instruction-following-on-llava-bench?p=cumo-scaling-multimodal-llm-with-co-upcycled)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cumo-scaling-multimodal-llm-with-co-upcycled/visual-question-answering-on-mmbench)](https://paperswithcode.com/sota/visual-question-answering-on-mmbench?p=cumo-scaling-multimodal-llm-with-co-upcycled)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cumo-scaling-multimodal-llm-with-co-upcycled/visual-question-answering-on-gqa-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-gqa-test-dev?p=cumo-scaling-multimodal-llm-with-co-upcycled)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cumo-scaling-multimodal-llm-with-co-upcycled/visual-question-answering-on-vqa-v2-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-dev?p=cumo-scaling-multimodal-llm-with-co-upcycled)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cumo-scaling-multimodal-llm-with-co-upcycled/visual-question-answering-on-mm-vet)](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?p=cumo-scaling-multimodal-llm-with-co-upcycled)`

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

9 May 2024 · Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen ·

Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally expensive and overlook the significance of improving model capabilities from the vision side. Inspired by the successful applications of Mixture-of-Experts (MoE) in LLMs, which improves model scalability during training while keeping inference costs similar to those of smaller models, we propose CuMo. CuMo incorporates Co-upcycled Top-K sparsely-gated Mixture-of-experts blocks into both the vision encoder and the MLP connector, thereby enhancing the multimodal LLMs with minimal additional activated parameters during inference. CuMo first pre-trains the MLP blocks and then initializes each expert in the MoE block from the pre-trained MLP block during the visual instruction tuning stage. Auxiliary losses are used to ensure a balanced loading of experts. CuMo outperforms state-of-the-art multimodal LLMs across various VQA and visual-instruction-following benchmarks using models within each model size group, all while training exclusively on open-sourced datasets. The code and model weights for CuMo are open-sourced at https://github.com/SHI-Labs/CuMo.

PDF Abstract

Code

Add Remove Mark official

shi-labs/cumo official

↳ Quickstart in

Spaces

Tasks

Add Remove

Image Captioning

Instruction Following

visual instruction following

Visual Question Answering

Visual Question Answering (VQA)

Datasets

GQA

Visual Question Answering v2.0

TextVQA

ScienceQA

MMBench

MM-Vet

SEED-Bench LLaVA-Bench

MathVista

COST

Results from the Paper

Edit

Ranked #1 on Visual Question Answering on MMBench (GPT-3.5 score metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	GQA test-dev	CuMo-7B	Accuracy	64.9	# 3	Compare
visual instruction following	LLaVA-Bench	CuMo-7B	avg score	85.7	# 1	Compare
Visual Question Answering	MMBench	CuMo-7B	GPT-3.5 score	73.0	# 1	Compare
Visual Question Answering	MM-Vet	CuMo-7B	GPT-4 score	51.0	# 18	Compare
Visual Question Answering	MM-Vet	CuMo-7B	Params	7B	# 1	Compare
Visual Question Answering (VQA)	VQA v2 test-dev	CuMo-7B	Accuracy	82.2	# 6	Compare

Methods

Add Remove

MoE

Edit Social Preview

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove