面向视觉-语言模型的递进互提示学习

打开文本图片集
doi: 10.19734/j. issn. 1001-3695.2024.10.0446
ProgCoPL: progressive co-prompting learning for vision-language models
Tao Junjie1,Zhang Weifeng1,2+,Wang Yuxia³,Miao Yi1 ,Xu Ling1 (1.Schoolofuece&o(lflellgee),ZgSUesit,g;2. Schoolfee&niUitinZ;i Institute,Jiaxing Zhejiang 31400o,China)
Abstract:Thelarge-scalepre-trainedvision-language modelCLIPaligns imagesandtexts inasharedsemanticspace,demonstratingrobust generalizationcapabilitiesacrossdiversedownstream tasks.However,existing promptlearning methodsoftenindependently insert learnable prompt vectors intoeach layerofCLIP's visualand text encoders.This appoach results in limitedcross-modalinteraction,withindependentpromptsacrosslayersfailing toefectivelyguidetheencoders incapturing taskrelevant information.Toaddress these isses,thispaper proposedProgCoPL.This method introduced text-guided promptvectorsintothevisualencoderlayersandvision-guidedpromptvectorsintothetextencoderlayers,therebyenhancingcro-modal interactionandalignment.Furthermore,ProgCoPL incorporated informationtransmissionchannelsbetweenpromptvectors acrosslayers,enablinghierarchicalandprogressiveintegrationof taskspecificinformation.Experimentson11datasetsshow thatProgCoPLeficientlyadaptsCLIPtodownstreamtasks,significantlyimprovingitscros-datasetgeneralizationability. ProgCoPLoutperforms existing methods in multiplegeneralization tests,particularlyachieving notable advancements incrossdataset scenarios.
Key Words:multimodal;prompt learning;vision-language model; Transformer encoder
0 引言
大规模视觉-语言模型(visual languagemodel,V-L Model)已经成为当今计算机跨模态智能领域的核心技术之一。(剩余19086字)