Text-Mining Research Group

Automatic Plagiarism Detection
Keywords:	Plagiarism, Copy Detection, Paraphrasing, N-grams, WordNet, Text-preprocessing, Multilingual Processing, Latent Semantic Analysis, Singular Value Decomposition
Description:	This project focuses on the particular field of automatic plagiarism detection in written text. The overlapping parts of documents are identified on the basis of common phrases to be represented by word N-grams. We employ Latent Semantic Analysis as a mathematical framework to infer the associations among the N-grams that are contained in the examined text documents. Moreover, this project deals with the issues of Text Pre-processing, Multilingual Processing, and Feature Selection.
Status:	Finished

People on this project:

Zdeněk Češka

E-mail: zceska@kiv.zcu.cz
WWW: http://www.kiv.zcu.cz/en/department/members/detail.html?login=zceska

Zdeněk has been working for various international companies in the field of Software Engineering. He has earned Master's Degree and PhD's Degree in the field of Computer Science and Engineering. His research interests include Mathematics & Algorithmization, Plagiarism Detection, Multilingual Processing, Text Classification, and other related fields.

Publications:

Automatic Plagiarism Detection Based on Latent Semantic Analysis
Authors:	Zdeněk Češka
Source:	VDM Verlag Dr. Müller, Saarbrüecken, Germany, August 2010. ISBN 978-3-639-28207-8.
Download:	Full text

Automatic Plagiarism Detection Based on Latent Semantic Analysis
Authors:	Zdeněk Češka
Source:	PhD Thesis - Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic, August 2009.

In Czech: Porovnání technik předzpracování textu pro detekci plagiátů
Authors:	Zdeněk Češka
Source:	Proceedings of the 8th Annual Conference ZNALOSTI 2009, Brno, Czech Republic, pp. 293-296, February 2009. ISBN 978-80-227-3015-0.
Download:	Full text [289 kB]

In Czech: Využití techniky náhodného indexování v oblasti detekce plagiátů
Authors:	Zdeněk Češka
Source:	Proceedings of the ITAT 2009, Information Technologies - Applications and Theory, pp. 23-26, Kralova studna, Slovakia, September 2008. ISBN 978-80-970179-1-0.
Download:	Full text [237 kB]

The Influence of Text Pre-processing on Plagiarism Detection
Authors:	Zdeněk Češka, Chris Fox
Source:	Proceedings of the 7th International Conference on Recent Advances in Natural Language Processing (RANLP 2009), Borovets, Bulgaria, pp. 55-59, September 2009. ISSN 1313-8502.
Download:	Full text [258 kB]

Free-Text Plagiarism Detection Based on Latent Semantic Analysis
Authors:	Zdeněk Češka
Source:	Technical Report No. DCSE/TR-2008-01, Pilsen, Czech Republic, April 2008.

In Czech: Extrakce N-gramů z rozsáhlých textů
Authors:	Zdeněk Češka, Ivo Hanák, Roman Tesař
Source:	Proceedings of the 7th Annual Conference ZNALOSTI 2008, Bratislava, Slovakia, pp. 54-65, February 2008. ISBN 978-80-227-2827-0.
Download:	Full text [568 kB]

In Czech: Využití moderních přístupů pro detekci plagiátů
Authors:	Zdeněk Češka
Source:	Proceedings of the ITAT 2008, Information Technologies - Applications and Theory, Hrebienok, Slovakia, pp. 23-26, September 2008. ISBN 978-80-969184-8-5.
Download:	Full text [626 kB]

Multilingual Plagiarism Detection
Authors:	Zdeněk Češka, Michal Toman, Karel Ježek
Source:	Artificial Intelligence: Methodology, Systems, and Applications, LNCS/LNAI 5253, pp. 83-92, Springer-Verlag Berlin Heidelberg, the 13th International Conference on Artificial Intelligence: Methodology, Systems, Applications (AIMSA 2008), Varna, Bulgaria, September 2008. ISSN 0302-9743. ISBN 978-3-540-85775-4.
Download:	Full text
View record in Web of Science®

Plagiarism Detection based on Singular Value Decomposition
Authors:	Zdeněk Češka
Source:	Advances in Natural Language Processing, LNCS/LNAI 5221, pp. 108-119, Springer-Verlag Berlin Heidelberg, the 6th International Conference on Natural Language Processing (GoTAL 2008), Gothenburg, Sweden, August 2008. ISSN 0302-9743. ISBN 978-3-540-85286-5.
Download:	Full text
View record in Web of Science®

In Czech: Využití N-gramů pro odhalování plagiátů
Authors:	Zdeněk Češka
Source:	Proceedings of the ITAT 2007, Information Technologies - Applications and Theory, Polana, Slovakia, pp. 63-66, September 2007. ISBN 978-80-969184-6-1.
Download:	Full text [212 kB]

Teraman: A Tool for N-gram Extraction from Large Datasets
Authors:	Zdeněk Češka, Ivo Hanák, Roman Tesař
Source:	Proceedings of the IEEE 3rd International Conference on Intelligent Computer Communication and Processing (IEEE ICCP 2007), Cluj-Napoca, Romania, pp. 209-216, September 2007. ISBN 978-1-4244-1491-8.
Download:	Full text
View record in Web of Science®

The Future of Copy Detection Techniques
Authors:	Zdeněk Češka
Source:	Proceedings of the 1st Young Researchers Conference on Applied Sciences (YRCAS 2007), Pilsen, Czech Republic, pp. 5-10, November 2007. ISBN 978-80-7043-574-8.
Download:	Full text [374 kB]

Related Downloads:

SVDPlag v1.0
Size:	2 kB
Desc:	This tool allows identifying cases of plagiarism in written text. This particular solution employs an advanced technique based on the Latent Semantic Analysis (LSA) framework to perform large statistics computations. For that purpose, Singular Value Decomposition (SVD) is used to infer the associations among the common N-grams contained in the examined documents. Moreover, this tool enables applying various text pre-processing techniques. This library has been developped in C# under the .NET Framework 3.5 which is required for runing as well as the 64-bit operating system. The supported architecture is x86-64. This tool employs Extreme Optimization Numerical Libraries for .NET version 3.5 64-bit. The older or 32-bit libraries are not supported.
Related:	Automatic Plagiarism Detection

Text-Mining Research Group

University of West Bohemia

Automatic Plagiarism Detection

People on this project:

Publications:

Related Downloads: