面向多义词例句语料生成的大模型微调指令自动化生成框架

  • 打印
  • 收藏
收藏成功


打开文本图片集

Abstract:First,a manual instruction setcontaining a body description set and a list of instruction examples is constructed as the initial input for the instruction pool.Then,input the instructions from the instruction pool into the large model to generate a number of machine-generated instructions corresponding to their corpora,the generated corpora are refined with text correction to obtain the desired polysemy example sentence corpus. Finaly,the edit distance algorithm is used to remove the weight of machine instructions,and the spectral clustering algorithm is used to cluster the candidate machine instructions,thereby achieving automated generation of machine instructions.By updating the instruction pool, iterative generation of the polysemy example sentence corpus is realized. The results show that the constructed polysemy example sentence dataset and its corresponding large model machine instruction set exhibit good linguistic diversity and content diversity. The constructed polysemy example sentence dataset meets the needs of second language learners in terms of sentence length,sentiment,vocabulary difficulty standard level ,and topics. Keywords:large language model; instruction generation; polysemy; example sentence generation; ChatGPT

中文作为一种复杂的语言,具有丰富的多义词现象,即一个字或一个词有多个不同的意义。(剩余11760字)

monitor