Text-Mining Research Group

In Czech: Klasifikace Suffix Tree frázemi - srovnání s metodou Itemsets

Classification based on Suffix Tree Phrases in Comparison with the Itemsets method

In this paper we present a text classification method using Suffix Tree (ST) phrases. We describe how to obtain ST-phrases from the training corpora, how to evaluate them and use them for text classification. Advantages and disadvantages of this approachare discussed and compared to the Itemsets method, which the Suffix Tree classification is based on. We also explain the way a threshold for multiclassclassification is determined. We devote some time to examine the document length influence on classification effectiveness and also compare the impact of higher order Itemsets and ST-phrases in both methods. Of course, some comparison of the results obtained with other favourite text classification methods is provided at last.

Keywords: text classification, document collection, itemsets, Suffix Tree, document evaluation, threshold determination

Year: 2005

Download:

Full text [301 kB]

Authors of this publication:

Roman Tesař

Phone: +420 377632479
E-mail: roman.tesar@gmail.com
WWW: http://www.sweb.cz/romant1/CV.pdf

Roman is a PhD student at the Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia in Pilsen, Czech Republic. His work is focused on the utilization of word n-grams in text classification and document filtering.

Karel Ježek

Phone: +420 377632475
E-mail: jezek_ka@kiv.zcu.cz
WWW: https://cs.wikipedia.org/wiki/Karel_Je%C5%BEek_(informatik)

Karel is the former group coordinator and a supervisor of PhD students working at research projects of this Group.

Related Projects:

Internet Content Filtering
Authors:	Roman Tesař, Karel Ježek
Desc.:	This project includes Web sites processing, analyzing, classification by means of their content and searching for other Web sites with similar content.

Document Classification
Authors:	Jiří Hynek, Karel Ježek, Michal Toman, Roman Tesař, Zdeněk Češka, Petr Grolmus
Desc.:	Use of inductive machine learning methods in classification of short text documents.

Text-Mining Research Group

University of West Bohemia

In Czech: Klasifikace Suffix Tree frázemi - srovnání s metodou Itemsets

Authors of this publication:

Roman Tesař

Karel Ježek

Related Projects:

Internet Content Filtering

Document Classification