Extending the Single Words-Based Document Model: A Comparison of Bigrams and 2-Itemsets

Extending the Single Words-Based Document Model: A Comparison of Bigrams and 2-Itemsets

The basic approach in text categorization is to represent documents by single words. However, often other features are utilized to achieve better classification results. In this paper, our attention is focused on bigrams and 2-itemsets. We compare theperformance improvement in terms of classification accuracy when these features are used to extend the single words-based document representation on two standard text corpora: Reuters-21578 and 20 Newsgroups. For this comparison we use the multinomial Naive Bayes classifier and five different feature selection approaches. Algorithms for bigrams and 2-itemsets discovery are presented as well. Our results show a statistically significant improvement when bigrams and also 2-itemsets are incorporated. However, in the case of 2-itemsets it is important touse an appropriate feature selection method. On the other hand, even when a simple feature selection approach is applied to discover bigrams the classification accuracy improves. The conclusion is that, in our case, it is not very effective to extenddocument representation with 2-itemsets because bigrams achieve better results and discovering them is less resource-consuming.

Keywords: Machine learning, feature selection, text categorization, document model, n-grams, bigrams, itemsets, comparison

Year: 2006

Download: download Full text 

Authors of this publication:

Roman Tesař

Phone: +420 377632479
E-mail: roman.tesar@gmail.com
WWW: http://www.sweb.cz/romant1/CV.pdf

Roman is a PhD student at the Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia in Pilsen, Czech Republic. His work is focused on the utilization of word n-grams in text classification and document filtering.

Václav Strnad

E-mail: vaclav.strnad@seznam.cz

Václav graduated at the University of West Bohemia in 2003, specialized in software engineering. He is currently working as a .NET developer for a commercial company. Occasionaly, in his free time, he is dealing with text classification and internet document filtering in cooperation with Roman Tesař.

Karel Ježek

Phone:  +420 377632475
E-mail: jezek_ka@kiv.zcu.cz
WWW: https://cs.wikipedia.org/wiki/Karel_Je%C5%BEek_(informatik)

Karel is the former group coordinator and a supervisor of PhD students working at research projects of this Group.

Related Projects:


Internet Content Filtering

Authors:  Roman Tesař, Karel Ježek
Desc.:This project includes Web sites processing, analyzing, classification by means of their content and searching for other Web sites with similar content.

Document Classification

Authors:  Jiří Hynek, Karel Ježek, Michal Toman, Roman Tesař, Zdeněk Češka, Petr Grolmus
Desc.:Use of inductive machine learning methods in classification of short text documents.