“博古问津”:知识图谱增强的文化遗产领域多模态大模型

打开文本图片集
中图分类号:TP391.4 DOI:10. 16152/j. cnki. xdxbzr. 2025-06-006
Bogu- Wenjin: A cultural heritage domain multimodal large model enhanced by knowledge graphs
ZHAO Wanqing1², XU Chaoyang', XIE Zhiwei', ZHANG Shaobo1², ZHANG Xiaodan1·²,PENG Jinye 1,2 (1.School of Electronic Information[School of Artifical Inteligence],Northwest University,Xi'an71O127,China; 2.Shaanxi Key Laboratory of Higher Education Institution of Generative Artificial Intelligence and Mixed Reality,Xi'an 710127,China)
AbstractIn recent years,large language models (LLMs) and multimodal large models (MLMs) have made significant achievements in natural language processing and multimodal content understanding. However,these general-purpose models have obvious shortcomings when dealing with tasks related to cultural heritage,such as biased understanding of domain-specific terminology,lack of cultural and historical background leading to superficial answers,and knowledge halucination issues,making it dificult for the results to meet actual needs. In response to these challenges,this paper first proposes a multimodal large model oriented towards thefield of cultural heritage:Bogu-Wenjin.This study first designs a semi-automated strategy to construct a large-scale multimodal cultural heritage dataset and forms a multimodal knowledge graph.Using the constructed dataset, thegeneral large model is trained intwo stages:image-text alignment and instruction fine-tuning,to adapt to the specific needs of the cultural heritage field. In addition,a knowledge graph is introduced as an auxiliary knowledgebase,and the credibilityand interpretabilityof the model inthe field of cultural heritage Q&A tasks are ffctively improved through graph-text retrieval and relationship retrieval strategies.Experimental results show that Bogu-Wenjin performs excellently in various aspects such asartifact image description,attribute question answering,and relationship question understanding. Compared with general multimodal large models, it significantly improves the ability to understand and answer complex cultural content,with a comprehensive score increase of 21.4% , 53% and 20.6% in artifact image description,artifact attribute questions,and artifact relationship questions respectively over the second-best model.
Keywordsmultimodal large models;cultural heritage; visual question answering; knowledge graphs;model fine-tuning;knowledge enhancement
随着人工智能技术的飞速发展,大语言模型(large languagemodels,LLMs)[1-3]已成为自然语言处理领域的一个突破性进展。(剩余20530字)