Text-Mining Research Group

Internet Content Filtering
Keywords:	topic detection, web content analysis, improper content
Description:	Nowadays, Internet has become a popular information medium. It opens new opportunities of presenting information by means of the links and this opportunity is widely used for information search and perception. More and more people use Internet for work and free time spending. But the rapid growth of Internet and public access causes some issues. There is no restriction, everybody can publish on Internet anything and everybody can see it. Many sites in Internet contain indecent, violent and generally unseemly content. Parents worry that their children might be accosted by pornography, violence, extremism or pedophiles. In cyberspace, people can change their personality very easily and they can be whoever they want to be. It is very difficult to trace the owners of unseemly web sites. There exist even servers containing especially unseemly or forbidden content. Our primary task is to detect these servers or individual sites in various languages. There exist many commercial applications for Internet content filtering using pre-classified web sites database. Our approach is to allow users to set their own level of exceptionability during web watching. Then we want to analyze stored exceptionable web sites (they usually contain links referring to next inappropriate sites) and to find servers containing most of these sites. The goal of this system is to facilitate the work of governmental institutions in preventing and combating Internet crime. Next usability is mainly in public institutions (schools, universities, libraries).
Status:	Finished

People on this project:

Roman Tesař

Phone: +420 377632479
E-mail: roman.tesar@gmail.com
WWW: http://www.sweb.cz/romant1/CV.pdf

Roman is a PhD student at the Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia in Pilsen, Czech Republic. His work is focused on the utilization of word n-grams in text classification and document filtering.

Karel Ježek

Phone: +420 377632475
E-mail: jezek_ka@kiv.zcu.cz
WWW: https://cs.wikipedia.org/wiki/Karel_Je%C5%BEek_(informatik)

Karel is the former group coordinator and a supervisor of PhD students working at research projects of this Group.

Publications:

Extending the Single Words-Based Document Model: A Comparison of Bigrams and 2-Itemsets
Authors:	Roman Tesař, Massimo Poesio, Václav Strnad, Karel Ježek
Source:	The 2006 ACM Symposium on Document Engineering(DocEng’06), Amsterdam, Netherlands, ACM press (New York, NY, USA), ISBN 1-59593-515-0, pages 138-146.
Download:	Full text

A comparison of two algorithms for discovering repeated word sequences
Authors:	Roman Tesař, Dalibor Fiala, François Rousselot, Karel Ježek
Source:	The 6th International Conference on Data Mining, Text Mining and their Business Applications (Data Mining 2005), Skiathos, Greece, ISBN 1-84564-017-9, pages121-131, WIT Transaction on Information and Communication Technologies, ISSN 1743-3517.
Download:	Full text [245 kB]
View record in Web of Science®

In Czech: Klasifikace Suffix Tree frázemi - srovnání s metodou Itemsets
Authors:	Roman Tesař, Karel Ježek
Source:	Znalosti 2005 conference, Stará Lesná, Slovakia, ISBN 80-248-0755-6, pages 144-153.
Download:	Full text [301 kB]

Text-Mining Research Group

University of West Bohemia

Internet Content Filtering

People on this project:

Roman Tesař

Karel Ježek

Publications:

Extending the Single Words-Based Document Model: A Comparison of Bigrams and 2-Itemsets

A comparison of two algorithms for discovering repeated word sequences

In Czech: Klasifikace Suffix Tree frázemi - srovnání s metodou Itemsets