基于MSAF与多模态任务的普通话唇语识别

打开文本图片集
中图分类号:TN929 文献标志码:A 文章编号:1000-582X(2026)04-107-10
doi:10.11835/j.issn.1000-582X.2026.04.010
Mandarin lip recognition based on MSAF with multimodal task
RONG Yujun',WU Xianhai,CAI Fenglin,YANG Tongxin,LI Penghua
(1.China Mobile(Hangzhou) InformationTechnology Co.,Ltd.,Hangzhou 310oo0,P.R.China; 2.Schoolof
Computer Science and Engineering(School of Artificial Intelligence),Chongqing University ofScience and
Technology, Chongqing 401331,P.R.China; 3.School ofAutomation, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China)
Abstract: Multimodal lip recognition aims to enhance speech recognition accuracy and robustnessby integrating lip movements and speech information,while also aiding specific user groups in communication. However, existing lip-speaking models predominantly focus on English datasets,leaving research on Chinese lip recognition in its nascent stage.Addressing challenges in handling data features across different modalities,integrating these features,and achieving comprehensive fusion of multimodal features,we propose a multimodal split attention fusion audio visual recognition (MSAFVR) model. Through experiments utilizing a Chinese Mandarin lip reading (CMLR) dataset, our model,MSAFVR,demonstrates significant advancements,achieving a remarkable 92.95% (20 accuracy in Chinese lip reading,surpassing state-of-the-art Mandarin lip readingmodels.
Keywords: lip recognition; multimodal task;multimodal split-attention fusion
随着人工智能技术的不断发展,唇语识别技术逐渐成为一个热门研究领域[2]。(剩余10291字)