基于CLIP文本特征增强的剪纸图像分类

  • 打印
  • 收藏
收藏成功


打开文本图片集

关键词:视觉语言大模型;剪纸分类;小样本分类;模态融合;提示学习 中图分类号:TP391 文献标志码:A 文章编号:1001-3695(2025)07-010-1994-09 doi:10.19734/j.issn.1001-3695.2024.11.0485

Abstract:Toaddressthechallengesoflarge modalitygaps between textand image featuresand insuficient classprototype representationin paper-cut image clasification,this paper proposed a CLIP-based textfeature enhancement method(CLIP visualtextenhancer,C-VTE).Themethdextractedtext featuresthrough manualprompttemplates,designedavisual-textenhancement module,andemployedCrosssAtentionand proportionalresidualconnections tofuseimageandtextfeatures,therebyreducing modalitydiscrepancyandenhancing the expressiveabilityofcategoryfeatures.Experimentsonapaper-cutdataset andfourpublicdatasets includingCaltech01validatedits efectivenessForbase-classclasificationonthepaper-cutdataset, C-VTE achieved 72.51% average accuracy,outperforming existing methods by 3.14 percentage points. In few-shot classification tasks on public datasets,it attained 84.78% average accuracy with a 2.45 percentage-point improvement.Ablation experimentsdemonstratethatboth themodalityfusion moduleand proportional residual components contribute significantlytoperformanceimprovement.Themethodofersnovelinsightsforeficientadaptationof vision-languagemodelsindownstreamclassification tasks,particularly suited for few-shot learning and base-class dominated scenarios.

Key words:visual language large model;paper-cut classification;few-shotclasification;multimodal fusion;prompt learning

0 引言

在非遗领域中,剪纸主要是以图片的形式存在,且种类复杂,数量繁多。(剩余22719字)

目录
monitor