反向聚焦细粒度多模态语义对齐的视频字幕模型

打印
收藏

收藏成功

微博 QQ空间微信

打开文本图片集

中图分类号：TP391 文献标志码：A 文章编号：1001-3695（2025）07-009-1986-08

doi：10.19734/j. issn.1001-3695.2024.11.0492

Abstract：Existingvideocaptioningoftenintroducemultimodal informationtoassistmodelsinextractingcriticalandfinegrained details fromcomplex anddynamic visual content.However，these methods tendtooverlook thesemantic gapscaused by representationaldiferencesamong modalities.Tobridgethesegaps，facilitateefectivecross-modalalignmentandeficientfusion，andenancetheextractionoffine-grainedsmanticinformatio，thispperproposedareverse-focusfingranedultio dal semanticalignmentforvideocaptioning（RM4Cap）.Thismodelcombinedanimage-textpaircorpusand facilitatedsemanticalignmentbetweenvideoandimage，indirectlyaligningvideorepresentationswithtextintheimage-textpairs.Anditdesignedareverse attention focusing algorithm to suppress redundant scene informationwhile highlighting inconspicuous objects and their interactions.Experimentsconductedonthe MSVDand MSRVTTdatasetsshow thatthe model significantlyoutperforms existing methods in metricssuch as CIDErand BLEU-4.It efectivelyresolves thealignmentchallenges andredundancy issues in multimodal fusion，further demonstrating its ability to narrow the cross-modal semantic gap.

Key words：video captioning；multimodal； reverse attention；semantic alignment； semantic gap

0 引言

视频字幕是一个连接视觉和语言并将视觉内容以自然语言描述的跨模态任务。（剩余21688字）

试读结束

购买全文6.00元下一篇基于CLIP文本特征增强的剪纸图像分类

计算机应用研究

2025年07期

¥12.00/本