基于小规模异构语言模型一致性委员会的数据剪枝方法

打印
收藏

收藏成功

微博 QQ空间微信

打开文本图片集

中图分类号：TP391 文献标志码：A 文章编号：1001-3695（2026）01-013-0110-10

doi：10.19734/j. issn.1001-3695.2026.05.0139

Data pruning method based on consistency committee of small-scale heterogeneous language models

Wang Kaiwen，Wang Yunzhe†，Tan Wei，Fu Qiming，Lu You，Chen Jianping （ScholofElectronicandIformationEnginering，uzhou UniersityofScienceadTecholog，SuzhouJiangsu5oia）

Abstract：Large language models（LLMs）fine-tuning performancestronglydependsonthequalityof trainingdata.Existing single-modelperplexity-baseddata evaluationmethodshadlimitations，including perplexitybias，wherelow-perplexitysamples couldstilleispreictedndossmodelirgceeredintodelsprucdinonistetppleiyoste samesamples.Toaddress these isues，this studyproposedamethodbasedontheconsistencyofaheterogeneous commiteeof smalllanguagemodels toevaluatedatavaluefromtwoperspectives.Themethodcalculatedthecoeficientofvariationofperplexityacrossmultiplemodelstomeasure modeldivergence.Italsocomputedpredictiondificultybycombining thesimilarity between predictedoutputsand reference answers.Basedonthese evaluations，the algorithm proposedthe MMCS（multi-model consistencyscore）metric toselect high-qualitytrainingdata.Experimentalresultsshowthatdata fiteredbyMMCSachieves beterfine-tuning performance than traditional methods ontwo mainstreamLLMsandthreepublicdatasets.Itobtainedoptimal resultsin27outof36comparativeexperiments.Thisfindingprovidesanewapproachforeficientdatapruning.The evaluation method based on multi-model divergence proves effective in improving the marginal utility of training data.

Keywords：large language model；data pruning；multi-model committee；perplexity

0 引言

随着大语言模型（LLMs）的迅猛发展，围绕其性能优化的竞争已日趋激烈。（剩余25719字）

试读结束

购买全文6.00元下一篇基于Biaffine机制和词汇增强的中文命名实体识别方法

计算机应用研究

2026年01期

¥12.00/本