Abstract:
To enhance the objectivity and accuracy of teaching-quality assessment in special delivery classrooms, a Multimodal Feature Fusion Network (MFFN) is developed for teacher-behavior recognition. Implicit teaching behaviors and low inter-class discrimination are addressed by integrating textual, acoustic and visual cues: an Implicit-Feature Aggregation Network (IFANet) extracts latent behavioral evidence from instructional texts; a Multi-dimensional Voice Information Aggregation (MVIA) module strengthens acoustic distinction among similar behaviors; and an improved YOLOv11 network captures fine-grained visual features. A dedicated dataset of teaching behaviors collected from special delivery classrooms is constructed, and comprehensive comparative and ablation experiments are conducted. MFFN surpasses state-of-the-art baselines in precision, recall and F1-score, registering improvements of 4.8%, 2.1% and 2.5% in precision, recall and mAP@0.5, respectively, together with a 24.3% gain in mAP@0.5∶0.95 over the standard YOLOv11. The proposed framework provides a solid foundation for subsequent educational applications such as objective teacher-competence evaluation and professional development.