Abstract:
Seal text recognition is crucial for modern digital document processing and identity verification. However, the detection and recognition of seal text in practical complex scenarios generally suffer from insufficient accuracy. To address this challenge, a deep learning-based seal text detection and recognition scheme is proposed as fo-llows: (1) To enhance the feature extraction capability of the model for seal text detection in complex scenarios, a residual-structured multi-scale feature fusion attention mechanism module (RES-EMA) is designed for text feature fusion based on the Differentiable Binarization Network (DBNet) architecture, leading to the development of a novel seal text detection model (RE-DBNet). (2) A new seal text recognition model (CASVTR) is optimized from the SVTR model: in the feature extraction stage, the local mixing module is replaced with a convolutional (Conv) module to construct a Conv+GM encoder, which strengthens the model's ability to extract character features; in the feature decoding stage, a Transformer-based CTC decoder is specially designed to solve the problems of multi-path decoding and feature misalignment in the decoding process of traditional models, thereby improving the accuracy of text recognition. Subsequently, a series of comparative experiments are conducted on the public ICDAR2023-ReST dataset. Specifically, the seal text detection performance of RE-DBNet is compared with that of VitDet, TPSNet, and TrOCR; the seal text recognition performance of CASVTR is compared with that of SAR, ABINet, SRN, and other models; the performance differences of text detection models integrated with different attention mechanism modules (RES-EMA, EMA, SE, and CBAM) are evaluated; and comparative experiments are carried out to investigate the seal text recognition performance of models with different encoders (Conv+GM and LM+GM) and deco-ders (CTC, CTC+LSTM, CTC+BiLSTM, CTC+Transformer). The experimental results demonstrate that: (1) The recall rate of RE-DBNet for seal text detection reaches 98.54%, which is 1.94% higher than that of VitDet—the model with the optimal recall rate among the comparison models. The accuracy and average normalized edit distance (ANED) of CASVTR for seal text recognition are 90.32% and 0.987 4, respectively, representing improvements of 1.84% and 0.018 7 compared with ParseQ—the model with the best performance among the comparison models. (2) In comparison with the DBNet-EMA model (incorporating the EMA module), RE-DBNet achieves increases of 2.7%, 0.03%, and 1.47% in Precision, Recall, and Hmean, respectively. The SVTR-Conv model with the Conv+GM encoder achieves an accuracy improvement of 0.85% compared with the original SVTR model. The CASVTR model with the Transformer-based CTC decoder exhibits increases of 2.43% in Accuracy and 0.003 8 in ANED compared with the SVTR-Conv-BiLSTM model. In conclusion, the proposed deep learning-based text detection and recognition scheme effectively addresses the issues of low seal text detection accuracy in complex scenarios, as well as multi-path interference and feature misalignment in CTC decoding during the recognition stage, and thus can provide reliable technical support for seal text detection and recognition in practical complex business scenarios.