In Czech: Extrakce N-gramů z rozsáhlých textů

In Czech: Extrakce N-gramů z rozsáhlých textů

In this paper, we present an algorithm for N-gram extraction from large datasets. To examine the overall time and memory complexities of our algorithm we employed the "Web 1T 5-gram Version 1" corpus released by Google. The experiments indicate that our approach reaches outstanding results among other available solutions in terms of speed and amount of processed data.

Keywords: N-gram Extraction, Large Data Processing, Batch Processing

Year: 2008

Download: download Full text [568 kB]

Authors of this publication:


Zdeněk Češka


E-mail: zceska@kiv.zcu.cz
WWW: http://www.kiv.zcu.cz/en/department/members/detail.html?login=zceska

Zdeněk has been working for various international companies in the field of Software Engineering. He has earned Master's Degree and PhD's Degree in the field of Computer Science and Engineering. His research interests include Mathematics & Algorithmization, Plagiarism Detection, Multilingual Processing, Text Classification, and other related fields.

Ivo Hanák


E-mail: hanak@kiv.zcu.cz
WWW: http://herakles.zcu.cz/~hanak/

Ivo graduated at UWB in 2003, specialized in computer graphics. Currently, he is a PhD student interested in computer graphics and digital holography.

Roman Tesař


Phone: +420 377632479
E-mail: roman.tesar@gmail.com
WWW: http://www.sweb.cz/romant1/CV.pdf

Roman is a PhD student at the Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia in Pilsen, Czech Republic. His work is focused on the utilization of word n-grams in text classification and document filtering.

Related Projects:


Project

Document Classification

Authors:  Jiří Hynek, Karel Ježek, Michal Toman, Roman Tesař, Zdeněk Češka, Petr Grolmus
Desc.:Use of inductive machine learning methods in classification of short text documents.
Project

Automatic Plagiarism Detection

Authors:  Zdeněk Češka
Desc.:This project focuses on the particular field of automatic plagiarism detection in written text. The main principle of this project is the application of Latent Semantic Analysis in conjunction with word N-grams.