Automatic Plagiarism Detection Based on Latent Semantic Analysis

Automatic Plagiarism Detection Based on Latent Semantic Analysis

Plagiarism is a widely spread problem that is the main focus of interest these days. The main objective of this PhD thesis is the application of Latent Semantic Analysis (LSA) framework in the field of written-text plagiarism detection. This particular field faces various issues that are discussed thoroughly. In order to infer the latent semantics from the given text, Singular Value Decomposition (SVD) is employed for the purpose of large statistical computations. That is why the proposed method is called SVDPlag. To overcome issues connected with a large amount of extracted N-grams from the text, a feature selection and subsequently a random indexing techniques are applied. Moreover, this thesis deals with the influence of text pre-processing on the accuracy of plagiarism detection. Simultaneously, the aspects of multilingual environment are explored. Various approaches in common use are discussed and compared with the new proposed method. A Czech corpus of 1,500 text documents about politics - created manually by students - was employed for the experiments. The results indicate that SVDPlag method significantly improves the accuracy of plagiarism detection and outperforms the other methods.

Keywords: Plagiarism, Copy Detection, Comparison, N-grams, Random Indexing, Feature Compression, Singular Value Decomposition, Latent Semantic Analysis, Lemmatization, Thesaurus, WordNet, Multilingual Processing

Year: 2009

Authors of this publication:

Zden─Ťk ─îe┼íka


Zden─Ťk has been working for various international companies in the field of Software Engineering. He has earned Master's Degree and PhD's Degree in the field of Computer Science and Engineering. His research interests include Mathematics & Algorithmization, Plagiarism Detection, Multilingual Processing, Text Classification, and other related fields.

Related Projects:


Automatic Plagiarism Detection

Authors:  Zden─Ťk ─îe┼íka
Desc.:This project focuses on the particular field of automatic plagiarism detection in written text. The main principle of this project is the application of Latent Semantic Analysis in conjunction with word N-grams.