易混淆样本驱动的簇间分布优化短文本聚类

  • 打印
  • 收藏
收藏成功


打开文本图片集

中图分类号:TP391 文献标志码:A 文章编号:1001-3695(2025)10-013-2996-09

doi:10.19734/j. issn.1001-3695.2025.03.0075

Confusing sample-driven inter-cluster optimization method for short text clustering

Enkaer Nuertai 1,2,3 ,Ma Bo 1,2,3† ,Wang Zhen 1,2,3 , Aizimaiti Ainiwaer 1,2,3 , Tuerhong Wusiman 1,3 , Yang Yating1,2,3 (1.Xinjangltefic&strecef,Uin;Uitfd myofSciencesngbfiti

Abstract:Short textclusteringaims topartitionunlabeledshorttextinstances intodiferentsemanticclusters.This task faces challnges fromindistinguishableconfusing samplesandoverlappng feature distributions between semanticallysimilarclusters. Totacklethesechalenges,thispaperproposedaconfusingsample-driven inter-clusterdistributionoptimizationmethod for shorttextclustering.Thismethodfirstlysampledhigh-uncertaintyinstancesbasedoninformationentropyasconfusingsamples,adselectedtheir neighboring cluster samples ascandidates.Then,it leveragedlarge language models to semantic discriminationandformed“confusing-positive-negative”triplets.Meanwhile,this methodadoptedadataaugmentationmethod basedonparameterrandomperturbation to generate instance-levelpositivesforeachsample.Finall,iperformedjointoptimizationoftheinter-clusterdistributionwithinacontrastivelearning framework.Experimentalresultsonfourpublicshorttext datasets demonstratethattheproposedmethodoutperforms existingstate-of-the-artmodels,withanaverageacuracyimprovement of 5.14% and an average normalized mutual information increase of 2.51% . Further analysis confirms that the method significantlyenhancessemanticdiscriminationofconfusing samples between clustersand efectivelyalleviates featureoverlap between semantically similar clusters.

Key Words:confusing samples; short text clustering; large language models;contrastive learning

0 引言

随着数字化进程的加速,短文本作为信息传播和交互的主要形式之一(如社交媒体帖子、即时消息、新闻标题、商品评论等),其数量和重要性显著增加。(剩余23892字)

目录
monitor
客服机器人