反向聚焦细粒度多模态语义对齐的视频字幕模型

  • 打印
  • 收藏
收藏成功


打开文本图片集

中图分类号:TP391 文献标志码:A 文章编号:1001-3695(2025)07-009-1986-08

doi:10.19734/j. issn.1001-3695.2024.11.0492

Abstract:Existingvideocaptioningoftenintroducemultimodal informationtoassistmodelsinextractingcriticalandfinegrained details fromcomplex anddynamic visual content.However,these methods tendtooverlook thesemantic gapscaused by representationaldiferencesamong modalities.Tobridgethesegaps,facilitateefectivecross-modalalignmentandeficientfusion,andenancetheextractionoffine-grainedsmanticinformatio,thispperproposedareverse-focusfingranedultio dal semanticalignmentforvideocaptioning(RM4Cap).Thismodelcombinedanimage-textpaircorpusand facilitatedsemanticalignmentbetweenvideoandimage,indirectlyaligningvideorepresentationswithtextintheimage-textpairs.Anditdesignedareverse attention focusing algorithm to suppress redundant scene informationwhile highlighting inconspicuous objects and their interactions.Experimentsconductedonthe MSVDand MSRVTTdatasetsshow thatthe model significantlyoutperforms existing methods in metricssuch as CIDErand BLEU-4.It efectivelyresolves thealignmentchallenges andredundancy issues in multimodal fusion,further demonstrating its ability to narrow the cross-modal semantic gap.

Key words:video captioning;multimodal; reverse attention;semantic alignment; semantic gap

0 引言

视频字幕是一个连接视觉和语言并将视觉内容以自然语言描述的跨模态任务。(剩余21688字)

目录
monitor