基于 Spark 和SimHash算法的文章原创性检测系统设计与实现

打开文本图片集
摘要:传统的基于文本的原创性检测建立在平台的投稿机制上,无法对现存的文章进行原创性检测。文章提出了基于Spark和SimHash算法的文章原创性检测系统,利用大数据技术进行非原创文章和原创文章配对,实现动态“阅读原文”功能。
关键词:Spark ;原创性检测; SimHash
中图法分类号:TP391文献标识码:A
Design and implementation of article originality detection system based onSpark and SimHash
Li Changdong
(East China Normal University,Shanghai 200241,China)
Abstract:The traditional text-based originality detection is based on the contribution mechanism of theplatform,which can not detect the originality of existing articles.This paper proposes an article originalitydetection system based on spark and SimHash algorithm,which uses big data technology to pair non-originalarticles and original articles to realize the function of “reading the original text”dynamically.
Key words: Spark , originality detection,SimHash
1 背景
互联网内容平台是内容的聚集地。(剩余4070字)