
Document Classification | |
Keywords: | classification, clustering, categorization, classifier, machine learning, spam, filter, unsolicited mail, content |
Description: | Use of inductive machine learning methods in classification of short text documents.Research includes implementation of the Itemsets classifier, Naive Bayes classifier, NBCI (Naive Bayes Combined with Itemsets), and TFxIDF classifier, in addition to clustering algorithms. Application of classification algorithms based on inductive machine learning in filtering of unsolicited mail (spam). |
Status: | Finished |
People on this project:

Jiřà Hynek
Phone: +420 603492837
E-mail: jhynek@kiv.zcu.cz
WWW: http://www.kiv.zcu.cz/staff/osobni.php?id_osoby=147&lang=EN
Jiri, a co-founder of the Text-Mining Research Group, works as a lecturer at the Dept. of Computer Science and Engineering. His research interests include machine learning and language-related problems. Jiri’s teaching activity is focused on good writing style and technical writing in general.

Karel Ježek
Phone: +420 377632475
E-mail: jezek_ka@kiv.zcu.cz
WWW: https://cs.wikipedia.org/wiki/Karel_Je%C5%BEek_(informatik)
Karel is the former group coordinator and a supervisor of PhD students working at research projects of this Group.

Michal Toman
E-mail: mtoman@kiv.zcu.cz
Michal graduated at UWB in 2003, specialized in software engineering. Currently, he is a PhD student interested in information retrieval, multilingual text processing, word sense disambiguation and knowledge discovery.

Roman TesaÅ™
Phone: +420 377632479
E-mail: roman.tesar@gmail.com
WWW: http://www.sweb.cz/romant1/CV.pdf
Roman is a PhD student at the Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia in Pilsen, Czech Republic. His work is focused on the utilization of word n-grams in text classification and document filtering.

Zdeněk Češka
E-mail: zceska@kiv.zcu.cz
WWW: http://www.kiv.zcu.cz/en/department/members/detail.html?login=zceska
Zdeněk has been working for various international companies in the field of Software Engineering. He has earned Master's Degree and PhD's Degree in the field of Computer Science and Engineering. His research interests include Mathematics & Algorithmization, Plagiarism Detection, Multilingual Processing, Text Classification, and other related fields.

Petr Grolmus
E-mail: indy@civ.zcu.cz
Petr used to be a co-founder of the Text-Mining research group. His interest was mainly focused on the identification of user profiles based on users behavior on the Web.
Publications:

In Czech: Extrakce N-gramů z rozsáhlých textů | |
Authors: | Zdeněk Češka, Ivo Hanák, Roman Tesař |
Source: | Proceedings of the 7th Annual Conference ZNALOSTI 2008, Bratislava, Slovakia, pp. 54-65, February 2008. ISBN 978-80-227-2827-0. |
Download: | ![]() |

In Czech: RozÅ¡ÃÅ™enà bag-of-words modelu dokumentu: srovnánà bigramů a 2-itemsetů | |
Authors: | Roman Tesař, Massimo Poesio, Václav Strnad, Karel Ježek |
Source: | In Proceedings of Znalosti 2007 Conference, Ostrava, Czech Republic, pp. 131-142, ISBN 978-80-248-1279-3, February 2007. |
Download: | ![]() |

In Czech: Vliv normalizace slov na klasifikaci textů | |
Authors: | Michal Toman, Roman Tesař, Karel Ježek |
Source: | Znalosti 2007, Ostrava |
Download: | ![]() |

Teraman: A Tool for N-gram Extraction from Large Datasets | |
Authors: | Zdeněk Češka, Ivo Hanák, Roman Tesař |
Source: | Proceedings of the IEEE 3rd International Conference on Intelligent Computer Communication and Processing (IEEE ICCP 2007), Cluj-Napoca, Romania, pp. 209-216, September 2007. ISBN 978-1-4244-1491-8. |
Download: | ![]() |
View record in Web of Science® |

The Fight against Spam - A Machine Learning Approach | |
Authors: | Karel Ježek, Jiřà Hynek |
Source: | Proceedings of the 11th International Conference on Electronic Publishing, Vienna, Austria, ISBN 978-3-85437-292-9 |
Download: | ![]() |

Extending the Single Words-Based Document Model: A Comparison of Bigrams and 2-Itemsets | |
Authors: | Roman Tesař, Massimo Poesio, Václav Strnad, Karel Ježek |
Source: | The 2006 ACM Symposium on Document Engineering(DocEng’06), Amsterdam, Netherlands, ACM press (New York, NY, USA), ISBN 1-59593-515-0, pages 138-146. |
Download: | ![]() |

Influence of Word Normalization on Text Classification | |
Authors: | Michal Toman, Roman Tesař, Karel Ježek |
Source: | InSciT 2006, Proceeding of Multidisciplinary Approaches to Global Information Systems, vol II, Merida, Spain |
Download: | ![]() |

Documents Categorization in Multilingual Environment | |
Authors: | Karel Ježek, Michal Toman |
Source: | ElPub2005, pp.97-104, Leuven, Belgium 2005, Peeters Publishing, ISBN 90-429-1645-1 |
Download: | ![]() |

In Czech: Klasifikace Suffix Tree frázemi - srovnánàs metodou Itemsets | |
Authors: | Roman Tesař, Karel Ježek |
Source: | Znalosti 2005 conference, Stará Lesná, Slovakia, ISBN 80-248-0755-6, pages 144-153. |
Download: | ![]() |

In Czech: Kategorizace textů metodou NBCI | |
Authors: | Martin KuÄera, Karel Ježek, Jiřà Hynek |
Source: | Proceedings of the 2nd Annual Conference Znalosti 2003, Czech Republic. Vojtěch Svátek (Ed). VŠB-Technická univerzita Ostrava, Czech Republic, ISBN 80-248-0229-5 |
Download: | ![]() |

User Profile Identification Based on Text Mining | |
Authors: | Petr Grolmus, Jiřà Hynek, Karel Ježek |
Source: | Proceedings of 6th International Conference on Information Systems Implementation and Modelling – ISIM ‘03 Brno, Czech Republic. Miroslav Beneš (Ed.). MARQ, Czech Republic, ISBN 80-85988-84-4 |
Download: | ![]() |

Use of Text Mining Methods in a Digital Library | |
Authors: | Jiřà Hynek, Karel Ježek |
Source: | Proceedings of the Sixth International Conference on Electronic Publishing – elpub2002 Karlovy Vary, Czech Republic, Joao A. Carvalho, Arved Hübler, Anna A. Baptista (Eds). Verlag für Wissenschaft und Forschung Berlin, Germany, ISBN 3-897-0035 |
Download: | ![]() |

Document Classification Using Itemsets | |
Authors: | Jiřà Hynek, Karel Ježek |
Source: | Proceedings of 34th Spring International Conference MOSIS 2000, Rožnov pod Radhoštěm, Czech Republic, J. Zendulka (Ed.). MARQ, Czech Republic, ISBN 80-85988-45-3 |
Download: | ![]() |
Related Downloads:

Teraman v1.0 | |
Size: | 2 kB |
Desc: | Teraman is a tool for N-gram extraction from large text datasets. Our approach is based on batch processing and therefore it is able to process texts which are much larger than the available memory. The process consists of three steps: pre-processing & indexing, counting N-grams and de-indexing. The tool is developed in C# under the .NET Framework 2.0 which is required for running. |
Related: | Document Classification |