易混淆样本驱动的簇间分布优化短文本聚类

打开文本图片集
中图分类号:TP391 文献标志码:A 文章编号:1001-3695(2025)10-013-2996-09
doi:10.19734/j. issn.1001-3695.2025.03.0075
Confusing sample-driven inter-cluster optimization method for short text clustering
Enkaer Nuertai 1,2,3 ,Ma Bo 1,2,3† ,Wang Zhen 1,2,3 , Aizimaiti Ainiwaer 1,2,3 , Tuerhong Wusiman 1,3 , Yang Yating1,2,3 (1.Xinjangltefic&strecef,Uin;Uitfd myofSciencesngbfiti
Abstract:Short textclusteringaims topartitionunlabeledshorttextinstances intodiferentsemanticclusters.This task faces challnges fromindistinguishableconfusing samplesandoverlappng feature distributions between semanticallysimilarclusters. Totacklethesechalenges,thispaperproposedaconfusingsample-driven inter-clusterdistributionoptimizationmethod for shorttextclustering.Thismethodfirstlysampledhigh-uncertaintyinstancesbasedoninformationentropyasconfusingsamples,adselectedtheir neighboring cluster samples ascandidates.Then,it leveragedlarge language models to semantic discriminationandformed“confusing-positive-negative”triplets.Meanwhile,this methodadoptedadataaugmentationmethod basedonparameterrandomperturbation to generate instance-levelpositivesforeachsample.Finall,iperformedjointoptimizationoftheinter-clusterdistributionwithinacontrastivelearning framework.Experimentalresultsonfourpublicshorttext datasets demonstratethattheproposedmethodoutperforms existingstate-of-the-artmodels,withanaverageacuracyimprovement of 5.14% and an average normalized mutual information increase of 2.51% . Further analysis confirms that the method significantlyenhancessemanticdiscriminationofconfusing samples between clustersand efectivelyalleviates featureoverlap between semantically similar clusters.
Key Words:confusing samples; short text clustering; large language models;contrastive learning
0 引言
随着数字化进程的加速,短文本作为信息传播和交互的主要形式之一(如社交媒体帖子、即时消息、新闻标题、商品评论等),其数量和重要性显著增加。(剩余23892字)