多维度交叉注意力融合的视听分割网络

打印
收藏

收藏成功

微博 QQ空间微信

打开文本图片集

doi：10.19734/j.issn.1001-3695.2024.08.0369

Audio-visual segmentation network with multi-dimensional cross-attention fusion

LiFanfan，Zhang Yuanyuan，Zhang Yonglong，Zhu Junwu† （School of Information Engineering，Yangzhou University，Yangzhou Jiangsu 2251Oo，China）

Abstract：Audio-visual segmentation （AVS）aimsto locateandaccuratelysegmentthesoundingobjects inimagesbasedon both visualandauditoryinformation.Whilemostexistingresearch focusesprimarilyonexploring methods foraudio-visualinformationfusio，thereisinsuicientin-depthexplorationoffine-grinedaudio-visualanalysis，particularlyinaligingcontinuousaudiofeatures withspatialpixel-level information.Therefore，thispaperproposedanaudio-visualsegmentationatention fusion（AVSAF）method basedoncontrastive learning.Firstly，themethodusedmulti-ead crossattentionmechanismand memorytokentoconstructaaudio-visualtokenfusionmodule toreducethelossofmulti-modalinformation.Secondlyitintro ducedcontrastivelearning tominimizethediscrepancybetweenaudioandvisualfeatures，enhancing theiralignment.Aduallayerdecoderwasthenemployedtoaccuratelypredictandsegment thetarget’sposition.Finalyitcarredoutalargeumber of experiments on the S4 and MS3 sub-datasets of the AVSBenge-Object dataset.The J -valueisincreasedby3.O4and4.71 percentage pointsrespectively，and the F valueis increased by 2.4 and3.5percentage points respectively，which fully proves the effectiveness of the proposed method in audio-visual segmentation tasks.

Key words：audio-visual segmentation；multi-modal；contrastive learning；attention mechanism

0引言

人类的感知是多维的，包括视觉、听觉、触觉、味觉和嗅觉。（剩余13740字）

试读结束

购买全文6.00元下一篇基于多模态表征学习的自动音频字幕方法

计算机应用研究

2025年06期

¥12.00/本