基于多模态表征学习的自动音频字幕方法

打印
收藏

收藏成功

微博 QQ空间微信

打开文本图片集

Tan Liwen’，Zhou Yi′ ，Liu Yin1，Cao Yin²+ （1.Scholofomucation&InfoationEnginering，Choging UniersityfPosts&elecomicains，hoging4oina; 2.Dept.of Intelligent Science，Xi'anJiaotong-Liverpool University，Suzhou Jiangsu 215ooo，China）

Abstract：Modalitydiscrepancies haveperpetuallyposedsignificant chalenges fortheapplicationofAACand acrossall multi-modalresearchdomains.Faciliatingmodelsincomprehendingtextinformationplaysapivotalroleinestablishinga seamless connection between thetwo modalities of textandaudio.Recent studies haveconcentratedonnarrowingthedisparity between thesetwo modalities viacontrastive learning.However，bridgingthegapbetweenthem merelybyemployingasimple contrastivelossfunctionishallenging.Inordertoreduceteinfluenceofmodal diffrencesand enhancetheutilizationf the modelforthetwomodalfeatures，thispaperproposed SimTLNet，anaudiocaptioning methodbasedonmulti-modalrepresentationlearning byintroducing anovelrepresentationmodule，TRANSLATOR，constructingatwin representation structure，and jointly optimizingthemodel weights throughcontrastive learning and momentum updates，which enabledthe model toconcurrentlylearnthecommonhigh-dimensional semantic informationbetwen theaudioandtextmodalities.Theproposed method achieves 0.251，0.782，0.480forMETEOR，CIDEr，and SPIDEr-FLon AudioCaps dataset and0.187，0.475，0.303 for Clotho V2dataset，respectively，whicharecomparablewith state-of-the-art methodsandefectivelybridgethediferencebetween the two modalities.

Key words：audio captioning；representation learning；contrastive learning；modality discrepancies；twin network

0 引言

自动音频字幕（AAC）是一项多模态生成任务，它联合音频和文本两种模态，生成音频的描述性字幕[1]。（剩余16830字）

试读结束

购买全文6.00元下一篇基于改进行为克隆算法的机器人运动控制策略

计算机应用研究

2025年06期

¥12.00/本