Influence of Word Normalization on Text Classification

Influence of Word Normalization on Text Classification

In this paper we focus our attention on the comparison of various lemmatization and stemming algorithms, which are oftenused in nature language processing (NLP). We describe thealgorithm in detail and compare it with other widely used algorithms for word normalization on two different corpora. Wepresent promising results obtained by our EWN-based lemmatization approach in comparison to other techniques. We alsodiscuss the influence of the word normalization on classification task in general.

Keywords: lemmatization, classification, EuroWordNet, stemming, word normalization

Year: 2006

Download: download Full text [167 kB]

Authors of this publication:

Michal Toman


Michal graduated at UWB in 2003, specialized in software engineering. Currently, he is a PhD student interested in information retrieval, multilingual text processing, word sense disambiguation and knowledge discovery.

Roman Tesa┼Ö

Phone: +420 377632479

Roman is a PhD student at the Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia in Pilsen, Czech Republic. His work is focused on the utilization of word n-grams in text classification and document filtering.

Karel Je┼żek

Phone:  +420 377632475

Karel is the former group coordinator and a supervisor of PhD students working at research projects of this Group.

Related Projects:


Document Classification

Authors:  Ji┼Ö├ş Hynek, Karel Je┼żek, Michal Toman, Roman Tesa┼Ö, Zden─Ťk ─îe┼íka, Petr Grolmus
Desc.:Use of inductive machine learning methods in classification of short text documents.