
Teraman: A Tool for N-gram Extraction from Large Datasets
In natural language processing (NLP) mainly single words are utilized to represent text documents. Recent studies have shown that this approach can be often improved by employing other, more sophisticated, features. Among them, mainly N-grams have been successfully used for this purpose and many algorithms and procedures for their extraction have been proposed. However, usually they are not primarily intended for large data processing, which has currently become a critical task. In this paper we present an algorithm for N-gram extraction from huge datasets. The experiments indicate that our approach reaches outstanding results among other available solutions in terms of speed and amount of processed data.
Keywords: N-gram Extraction, Large Data Processing, Batch Processing
Year: 2007

Authors of this publication:

Zdeněk Češka
E-mail: zceska@kiv.zcu.cz
WWW: http://www.kiv.zcu.cz/en/department/members/detail.html?login=zceska

Ivo Hanák
E-mail: hanak@kiv.zcu.cz
WWW: http://herakles.zcu.cz/~hanak/

Roman TesaÅ™
Phone: +420 377632479
E-mail: roman.tesar@gmail.com
WWW: http://www.sweb.cz/romant1/CV.pdf
Related Projects:

Document Classification | |
Authors: | Jiřà Hynek, Karel Ježek, Michal Toman, Roman Tesař, Zdeněk Češka, Petr Grolmus |
Desc.: | Use of inductive machine learning methods in classification of short text documents. |

Automatic Plagiarism Detection | |
Authors: | Zdeněk Češka |
Desc.: | This project focuses on the particular field of automatic plagiarism detection in written text. The main principle of this project is the application of Latent Semantic Analysis in conjunction with word N-grams. |