面向用户生成内容的旅游领域命名实体识别方法

打开文本图片集
DOI:10.16652/j.issn.1004-373x.2026.10.004
关键词:命名实体识别;用户生成内容;旅游语料;ERNIE2.0;双向长短期记忆网络;多头自注意力;对抗训练
中图分类号:TN912.3-34;TP391.1
文献标识码:A
文章编号:1004-373X(2026)10-0022-07
Method of domain-specific named entity recognition for tourism-oriented user generated content
Xu Chun 1 , Liu Peizhen 1 , Yan Rong 2
School of Information Management, Xinjiang University of Finance and Economics, Urumqi 830012, China; 2. School of Tourism, Xinjiang University of Finance and Economics, Urumqi 830012, China)
Abstract: In allusion to the challenges of named entity recognition (NER) in tourism-oriented user-generated content (UGC), including a large amount of noisy data, ambiguous nested entity boundaries, and lengthy text corpora under web-based contexts, a novel NER model integrating multi-head self-attention and adversarial training is proposed. The ERNIE 2.0 is used to encode tourism-specific corpora, generating semantically enriched dynamic word embedding. Adversarial training such as fast gradient method (FGM) and projected gradient descent (PGD) are introduced at the embedding layer by injecting minimal perturbations into word vectors to generate adversarial examples, so as to simulate noise characteristics in tourism UGC. A hybrid feature extraction layer combining bidirectional long short-term memory (BiLSTM) and multi-head self-attention mechanisms is constructed to capture the dependency relationship between entity boundary information and long-distance text, so as to dynamically adjust the feature weight distributions. A conditional random field (CRF) is used to decode the global optimal label sequence. The experiments were conducted on the self-built tourism dataset and the open-source news dataset CLUENER2020. The results show that the accuracy, recall, and F1 -score of the proposed model on both datasets are all improved compared with the baseline model. It indicates that the model can still maintain high recognition accuracy on datasets from different fields, verifying its good generalization and robustness.
Keywords: named entity recognition; user generated content; tourism corpus; ERNIE2.0; bidirectional long short-term memory network; multi-head self-attention; adversarial training
命名实体识别(Named Entity Recognition, NER) [1] 是自然语言处理(NLP)中的一项重要子任务,其主要目标是从非结构化文本中自动识别实体,并按照既定的类别对其进行分类。(剩余11383字)