Event-Based Video Reconstruction Using Transformer
Event cameras, which output events by detecting spatio-temporal brightness changes, bring a novel paradigm to image sensors with high dynamic range and low latency. Previous works have achieved impressive performances on event-based video reconstruction by introducing convolutional neural networks (CNNs). However, intrinsic locality of convolutional operations is not capable of modeling long-range dependency, which is crucial to many vision tasks. In this paper, we present a hybrid CNN-Transformer network for event-based video reconstruction (ET-Net), which merits the fine local information from CNN and global contexts from Transformer. In addition, we further propose a Token Pyramid Aggregation strategy to implement multi-scale token integration for relating internal and intersected semantic concepts in the token-space. Experimental results demonstrate that our proposed method achieves superior performance over state-of-the-art methods on multiple real-world event datasets. The code is available at https://github.com/WarranWeng/ET-Net
PDF AbstractCode
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Event-based Object Segmentation | DDD17-SEG | ETNet | mIoU | 0.34 | # 2 | |
Event-based Object Segmentation | DSEC-SEG | ETNet | mIoU | 0.36 | # 2 | |
Video Reconstruction | Event-Camera Dataset | ET-Net | Mean Squared Error | 0.047 | # 2 | |
LPIPS | 0.224 | # 2 | ||||
Video Reconstruction | MVSEC | ET-Net | Mean Squared Error | 0.107 | # 2 | |
LPIPS | 0.489 | # 2 | ||||
Event-based Object Segmentation | MVSEC-SEG | ETNet | mIoU | 0.37 | # 2 | |
Event-based Object Segmentation | RGBE-SEG | ETNet | mIoU | 0.35 | # 2 |