基于深度学习的印章文本检测与识别

张涵; 徐丽格; 胡东浩; 余宝贤; 李百成; 张翊

doi:10.6054/j.jscnun.2025060

基于深度学习的印章文本检测与识别

Seal Text Detection and Recognition Based on Deep Learning

摘要

摘要: 印章文本识别对现代数字化文档处理与身份核验至关重要，然而，实际复杂场景下的印章文本检测与识别普遍存在精准度不足的问题。针对此，文章提出一种基于深度学习的印章文本检测与识别方案：(1)为提升模型在复杂场景中印章文本检测的特征提取能力，以可微分二值化网络(DBNet)为基础架构，面向文本特征融合，设计了带有残差结构多尺度特征融合注意力机制模块(RES-EMA)，提出一种新的印章文本检测模型(RE-DBNet)。(2)基于SVTR模型，优化得到新的印章文本识别模型(CASVTR)：在特征提取阶段，采用Conv卷积模块替换局部混合模块，构建Conv+GM编码器，以强化模型对字符特征的提取能力；在特征解码阶段，针对性设计基于Transformer的CTC解码器，以解决传统模型解码过程存在的多路径解码与特征不对齐问题，提升文本识别精度。最后，在ICDAR2023-ReST公开数据集上，将RE-DBNet模型与VitDet、TPSNet、TrOCR模型进行印章文本检测对比，将CASVTR模型与SAR、ABINet、SRN等模型进行印章文本识别对比；比较RES-EMA、EMA、SE、CBAM等注意力机制模块对文本检测模型的性能差异；开展编码器分别为Conv+GM、LM+GM以及解码器分别为CTC、CTC+LSTM、CTC+BiLSTM、CTC+Transformer的模型在印章文本识别中的性能对比实验。实验结果表明：(1)RE-DBNet模型的印章文本检测召回率达98.54%，较对比模型中召回率最优的VitDet模型提升了1.94%；CASVTR模型的印章文本识别准确率、平均归一化编辑距离分别为90.32%、0.987 4，较对比模型中性能最优的ParseQ模型分别提升了1.84%、0.018 7。(2)与增加EMA模块的DBNet-EMA模型相比，RE-DBNet模型的Precision、Recall、Hmean值分别提高了2.7%、0.03%、1.47%；基于Conv+GM编码器的SVTR-Conv模型的Accuracy值比SVTR模型提高了0.85%，基于Transfor-mer的CTC解码器的CASVTR模型的Accuracy、ANED值分别比SVTR-Conv-BiLSTM模型提高了2.43%、0.003 8。综上可知，文章提出的基于深度学习的文本检测与识别方案有效解决了复杂场景下印章文本检测精度低、识别阶段CTC解码多路径干扰及特征不对齐的问题，可为实际复杂业务场景的印章文本检测与识别提供可靠的技术支撑。

Abstract: Seal text recognition is crucial for modern digital document processing and identity verification. However, the detection and recognition of seal text in practical complex scenarios generally suffer from insufficient accuracy. To address this challenge, a deep learning-based seal text detection and recognition scheme is proposed as fo-llows: (1) To enhance the feature extraction capability of the model for seal text detection in complex scenarios, a residual-structured multi-scale feature fusion attention mechanism module (RES-EMA) is designed for text feature fusion based on the Differentiable Binarization Network (DBNet) architecture, leading to the development of a novel seal text detection model (RE-DBNet). (2) A new seal text recognition model (CASVTR) is optimized from the SVTR model: in the feature extraction stage, the local mixing module is replaced with a convolutional (Conv) module to construct a Conv+GM encoder, which strengthens the model's ability to extract character features; in the feature decoding stage, a Transformer-based CTC decoder is specially designed to solve the problems of multi-path decoding and feature misalignment in the decoding process of traditional models, thereby improving the accuracy of text recognition. Subsequently, a series of comparative experiments are conducted on the public ICDAR2023-ReST dataset. Specifically, the seal text detection performance of RE-DBNet is compared with that of VitDet, TPSNet, and TrOCR; the seal text recognition performance of CASVTR is compared with that of SAR, ABINet, SRN, and other models; the performance differences of text detection models integrated with different attention mechanism modules (RES-EMA, EMA, SE, and CBAM) are evaluated; and comparative experiments are carried out to investigate the seal text recognition performance of models with different encoders (Conv+GM and LM+GM) and deco-ders (CTC, CTC+LSTM, CTC+BiLSTM, CTC+Transformer). The experimental results demonstrate that: (1) The recall rate of RE-DBNet for seal text detection reaches 98.54%, which is 1.94% higher than that of VitDet—the model with the optimal recall rate among the comparison models. The accuracy and average normalized edit distance (ANED) of CASVTR for seal text recognition are 90.32% and 0.987 4, respectively, representing improvements of 1.84% and 0.018 7 compared with ParseQ—the model with the best performance among the comparison models. (2) In comparison with the DBNet-EMA model (incorporating the EMA module), RE-DBNet achieves increases of 2.7%, 0.03%, and 1.47% in Precision, Recall, and Hmean, respectively. The SVTR-Conv model with the Conv+GM encoder achieves an accuracy improvement of 0.85% compared with the original SVTR model. The CASVTR model with the Transformer-based CTC decoder exhibits increases of 2.43% in Accuracy and 0.003 8 in ANED compared with the SVTR-Conv-BiLSTM model. In conclusion, the proposed deep learning-based text detection and recognition scheme effectively addresses the issues of low seal text detection accuracy in complex scenarios, as well as multi-path interference and feature misalignment in CTC decoding during the recognition stage, and thus can provide reliable technical support for seal text detection and recognition in practical complex business scenarios.

HTML全文

参考文献(35)

施引文献

资源附件(0)