Text-Mining Research Group

The Influence of Text Pre-processing on Plagiarism Detection

This paper explores the influence of text pre-processing techniques on plagiarism detection. We examine stop-word removal, lemmatization, number replacement, synonymy recognition, and word generalization. We also look into the influence of punctuation and word-order within N-grams. All these techniques are evaluated according to their impact on F1-measure and speed of execution. Our experiments were performed on a Czech corpus of plagiarized documents about politics. At the end of this paper, we propose what we consider to be the best combination of text pre-processing techniques.

Keywords: Plagiarism, Copy Detection, Natural Language Processing, Stop-words, Lemmatization, Synonymy, WordNet, Thesaurus

Year: 2009

Download:

Full text [258 kB]

Authors of this publication:

Zdeněk Češka

E-mail: zceska@kiv.zcu.cz
WWW: http://www.kiv.zcu.cz/en/department/members/detail.html?login=zceska

Zdeněk has been working for various international companies in the field of Software Engineering. He has earned Master's Degree and PhD's Degree in the field of Computer Science and Engineering. His research interests include Mathematics & Algorithmization, Plagiarism Detection, Multilingual Processing, Text Classification, and other related fields.

Chris Fox

E-mail: foxcj@essex.ac.uk
WWW: http://dces.essex.ac.uk/staff/foxcj/

Chris is a reader at the School of Computer Science and Electronic Engineering, University of Essex. His research focuses on the philosophy of language and formal semantics.

Related Projects:

Automatic Plagiarism Detection
Authors:	Zdeněk Češka
Desc.:	This project focuses on the particular field of automatic plagiarism detection in written text. The main principle of this project is the application of Latent Semantic Analysis in conjunction with word N-grams.