Text-Mining Research Group

In Czech: Vliv normalizace slov na klasifikaci textů

On the Impact of Morphological Normalization on Text Categorization

In this paper we focus our attention on the comparison of various lemmatization and stemming algorithms, which are often used in natural language processing (NLP). We present a lemmatization algorithm that utilizes the multilingual thesaurus Eurowordnet (EWN). We describe the algorithm in detail and compare it with other widely used algorithms for word normalization on two different corpora. We also discuss the influence of the word normalization on classification task in general.

Keywords: word normalisation, lemmatisation, stemming, classification

Year: 2007

Download:

Full text [201 kB]

Authors of this publication:

Michal Toman

E-mail: mtoman@kiv.zcu.cz

Michal graduated at UWB in 2003, specialized in software engineering. Currently, he is a PhD student interested in information retrieval, multilingual text processing, word sense disambiguation and knowledge discovery.

Roman Tesař

Phone: +420 377632479
E-mail: roman.tesar@gmail.com
WWW: http://www.sweb.cz/romant1/CV.pdf

Roman is a PhD student at the Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia in Pilsen, Czech Republic. His work is focused on the utilization of word n-grams in text classification and document filtering.

Karel Ježek

Phone: +420 377632475
E-mail: jezek_ka@kiv.zcu.cz
WWW: https://cs.wikipedia.org/wiki/Karel_Je%C5%BEek_(informatik)

Karel is the former group coordinator and a supervisor of PhD students working at research projects of this Group.

Related Projects:

Document Classification
Authors:	Jiří Hynek, Karel Ježek, Michal Toman, Roman Tesař, Zdeněk Češka, Petr Grolmus
Desc.:	Use of inductive machine learning methods in classification of short text documents.