Search Results for author: Wenke Xia

Found 7 papers, 6 papers with code

Learning Manipulation by Predicting Interaction

1 code implementation1 Jun 2024 Jia Zeng, Qingwen Bu, Bangjun Wang, Wenke Xia, Li Chen, Hao Dong, Haoming Song, Dong Wang, Di Hu, Ping Luo, Heming Cui, Bin Zhao, Xuelong Li, Yu Qiao, Hongyang Li

To this end, we propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction (MPI) and enhances the visual representation. Given a pair of keyframes representing the initial and final states, along with language instructions, our algorithm predicts the transition frame and detects the interaction object, respectively.

Representation Learning

SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

no code implementations30 May 2024 Junjie Zhang, Chenjia Bai, Haoran He, Wenke Xia, Zhigang Wang, Bin Zhao, Xiu Li, Xuelong Li

In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning.

Instruction Following Robot Manipulation +1

Robust Cross-Modal Knowledge Distillation for Unconstrained Videos

1 code implementation16 Apr 2023 Wenke Xia, Xingjian Li, Andong Deng, Haoyi Xiong, Dejing Dou, Di Hu

However, such semantic consistency from the synchronization is hard to guarantee in unconstrained videos, due to the irrelevant modality noise and differentiated semantic correlation.

Action Recognition Audio Tagging +3

Balanced Audiovisual Dataset for Imbalance Analysis

1 code implementation14 Feb 2023 Wenke Xia, Xu Zhao, Xincheng Pang, Changqing Zhang, Di Hu

We surprisingly find that: the multimodal models with existing imbalance algorithms consistently perform worse than the unimodal one on specific subsets, in accordance with the modality bias.

Revisiting Pre-training in Audio-Visual Learning

1 code implementation7 Feb 2023 Ruoxuan Feng, Wenke Xia, Di Hu

Specifically, we explore the effects of pre-trained models on two audio-visual learning scenarios: cross-modal initialization and multi-modal joint learning.

audio-visual learning

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World

1 code implementation14 Jan 2023 Hongpeng Lin, Ludan Ruan, Wenke Xia, Peiyu Liu, Jingyuan Wen, Yixin Xu, Di Hu, Ruihua Song, Wayne Xin Zhao, Qin Jin, Zhiwu Lu

Experimental results indicate that the models incorporating large language models (LLM) can generate more diverse responses, while the model utilizing knowledge graphs to introduce external knowledge performs the best overall.

Knowledge Graphs

Cannot find the paper you are looking for? You can Submit a new open access paper.