In Czech: Využití techniky náhodného indexování v oblasti detekce plagiátů

In Czech: Využití techniky náhodného indexování v oblasti detekce plagiátů

Plagiarism is a wide spread problem that is of great interest these days because of the ease with which electronic documents can be copied. This paper extends the idea of the Latent Semantic Analysis (LSA) application in the field of plagiarism detection and proposes new improvements. The main subject of this paper is the application of a feature compression technique to overcome the problem of processing large amounts of data. Another issue to be discussed is document similarity normalization. A Czech corpus of 1,500 text documents about politics was employed for the experiments. This corpus included documents that had been manually plagiarized by students. The results indicate that the proposed compression technique is able to essentially decrease time execution requirements. Moreover, it has been proved that the new proposed document similarity normalization formula increases the accuracy of plagiarism detection.

Keywords: Plagiarism, Copy Detection, Comparison, Random Indexing, Feature Compression, Latent Semantic Analysis, Singular Value Decomposition

Year: 2009

Download: download Full text [237 kB]

Authors of this publication:

Zdeněk Češka


Zdeněk has been working for various international companies in the field of Software Engineering. He has earned Master's Degree and PhD's Degree in the field of Computer Science and Engineering. His research interests include Mathematics & Algorithmization, Plagiarism Detection, Multilingual Processing, Text Classification, and other related fields.

Related Projects:


Automatic Plagiarism Detection

Authors:  Zdeněk Češka
Desc.:This project focuses on the particular field of automatic plagiarism detection in written text. The main principle of this project is the application of Latent Semantic Analysis in conjunction with word N-grams.