The YOLO model that still excels in document layout analysis
Document layout analysis can help people better understand and use the information in a document. However, the diversity of document layouts and considerable variation in aspect ratios among document objects pose significant challenges. In this study, we designed the Multi-Convolutional Deformable Separation (MCDS) module as the main structure of the network, using the YOLO model as a baseline. Integration of this module into the Backbone and Neck layers enhances the image feature extraction process significantly. Moreover, we incorporate ParNet-Attention to direct the network's focus toward document objects through parallel networks, thereby facilitating a more exhaustive feature extraction. To optimize the model's predictive potential, the Decouple Fusion Head (DFH) is employed within the Head layer. This technique leverages multi-scale features based on the decoupled head, thereby enhancing the accuracy of predictions. Our proposed model achieves remarkable performance on three distinct public datasets with varying characteristics, namely ICDAR-POD, PubLayNet, and IIIT-AR-13K. Notably, in ICDAR-POD, both IoU$_{0.6}$ and IoU$_{0.8}$ achieve the optimal mean Average Precision (mAP), 96.2 and 94.4, respectively.
PDF Abstract