
Automatic Plagiarism Detection Based on Latent Semantic Analysis
Plagiarism is a widely spread problem that is the main focus of interest these days. The main objective of this PhD thesis is the application of Latent Semantic Analysis (LSA) framework in the field of written-text plagiarism detection. This particular field faces various issues that are discussed thoroughly. In order to infer the latent semantics from the given text, Singular Value Decomposition (SVD) is employed for the purpose of large statistical computations. That is why the proposed method is called SVDPlag. To overcome issues connected with a large amount of extracted N-grams from the text, a feature selection and subsequently a random indexing techniques are applied. Moreover, this thesis deals with the influence of text pre-processing on the accuracy of plagiarism detection. Simultaneously, the aspects of multilingual environment are explored. Various approaches in common use are discussed and compared with the new proposed method. A Czech corpus of 1,500 text documents about politics - created manually by students - was employed for the experiments. The results indicate that SVDPlag method significantly improves the accuracy of plagiarism detection and outperforms the other methods.
Keywords: Plagiarism, Copy Detection, Comparison, N-grams, Random Indexing, Feature Compression, Singular Value Decomposition, Latent Semantic Analysis, Lemmatization, Thesaurus, WordNet, Multilingual Processing
Year: 2009
Authors of this publication:

Zdeněk Češka
E-mail: zceska@kiv.zcu.cz
WWW: http://www.kiv.zcu.cz/en/department/members/detail.html?login=zceska
Related Projects:

Automatic Plagiarism Detection | |
Authors: | Zdeněk Češka |
Desc.: | This project focuses on the particular field of automatic plagiarism detection in written text. The main principle of this project is the application of Latent Semantic Analysis in conjunction with word N-grams. |