Extending the Single Words-Based Document Model: A Comparison of Bigrams and 2-Itemsets
The basic approach in text categorization is to represent documents by single words. However, often other features are utilized to achieve better classification results. In this paper, our attention is focused on bigrams and 2-itemsets. We compare theperformance improvement in terms of classification accuracy when these features are used to extend the single words-based document representation on two standard text corpora: Reuters-21578 and 20 Newsgroups. For this comparison we use the multinomial Naive Bayes classifier and five different feature selection approaches. Algorithms for bigrams and 2-itemsets discovery are presented as well. Our results show a statistically significant improvement when bigrams and also 2-itemsets are incorporated. However, in the case of 2-itemsets it is important touse an appropriate feature selection method. On the other hand, even when a simple feature selection approach is applied to discover bigrams the classification accuracy improves. The conclusion is that, in our case, it is not very effective to extenddocument representation with 2-itemsets because bigrams achieve better results and discovering them is less resource-consuming.
Keywords: Machine learning, feature selection, text categorization, document model, n-grams, bigrams, itemsets, comparison
Year: 2006
Authors of this publication:
Roman Tesař
Phone: +420 377632479
E-mail: roman.tesar@gmail.com
WWW: http://www.sweb.cz/romant1/CV.pdf
Václav Strnad
E-mail: vaclav.strnad@seznam.cz
Karel Ježek
Phone: +420 377632475
E-mail: jezek_ka@kiv.zcu.cz
WWW: https://cs.wikipedia.org/wiki/Karel_Je%C5%BEek_(informatik)
Related Projects:
Internet Content Filtering | |
Authors: | Roman Tesař, Karel Ježek |
Desc.: | This project includes Web sites processing, analyzing, classification by means of their content and searching for other Web sites with similar content. |
Document Classification | |
Authors: | Jiří Hynek, Karel Ježek, Michal Toman, Roman Tesař, Zdeněk Češka, Petr Grolmus |
Desc.: | Use of inductive machine learning methods in classification of short text documents. |