基于检索增强生成的主题推理模型

打开文本图片集
关键词:主题模型;大语言模型;检索增强生成;思维链;最优传输理论中图分类号:TP391 文献标志码:A 文章编号:1001-3695(2025)10-016-3019-08doi:10.19734/j. issn. 1001-3695.2025.03.0059
Topic inference model based on retrieval augmented generation
PanLihu,Li Jie,Zhao Hongyan+(ColegeofComputerScience&Technology,TaiyuanUniversityofSience&Technology,TaiyuanO3oO24,China)
Abstract:Toaddressthechallngesof insuficienttopicdiversityandlimitedmodel interpretabilityfacedbylargelanguage models (LLMs)in topic modeling,this paper proposeda topic inference model based on retrieval augmented generation(TIM_ RAG),whichachieved topicgenerationanddistribution inferencethrough thre-stagearchitecture.Firstly,intheRAGretrieval phase,itdesignedamulti-dimensionaldocumentsimilarityretrievalmethodthatfilteredsimilardocumentsehibiting bothterm-frequencyrelevanceanddeepsemanticassociations,thus enrichingthetopic informationof individualdocumentand enhancingtopicdiversitySecondly,intheRAGgenerationphase,itimplementedamulti-perspectivetopicgenerationstrategy,whichusedchain-of-thought prompting toguidetheLLMin extracting multi-angletopic termsand generated interediate reasoningsteps toincreaseprocesstransparency.Finally,intheindependent inferencephase,itintroducedanembedding transport plan basedonoptimal transport theoryto modelsemanticrelationships between document-topic andtopic-word,significantlyimproving model interpretabilityExperimentsontheWikiText-103,BBCNews,and2ONewsgroupsdatasetsshow thatTIM_RAG effectivelymitigates topic diversity limitations while enhancing topic modeling performance.
Key words:topicmodel;large language model;retrievalaugmented generation;chain-of-thought;optimaltransporttheory
0 引言
随着信息技术的高速发展,文本数据呈爆炸式增长,如何有效地从文档集合中自动推断潜在的主题结构(即主题推理)已成为文本分析领域的核心挑战。(剩余19939字)