Internet Content Filtering

Keywords: topic detection, web content analysis, improper content
Description: Nowadays, Internet has become a popular information medium. It opens new opportunities of presenting information by means of the links and this opportunity is widely used for information search and perception. More and more people use Internet for work and free time spending. But the rapid growth of Internet and public access causes some issues.

There is no restriction, everybody can publish on Internet anything and everybody can see it. Many sites in Internet contain indecent, violent and generally unseemly content. Parents worry that their children might be accosted by pornography, violence, extremism or pedophiles. In cyberspace, people can change their personality very easily and they can be whoever they want to be. It is very difficult to trace the owners of unseemly web sites. There exist even servers containing especially unseemly or forbidden content.

Our primary task is to detect these servers or individual sites in various languages. There exist many commercial applications for Internet content filtering using pre-classified web sites database. Our approach is to allow users to set their own level of exceptionability during web watching. Then we want to analyze stored exceptionable web sites (they usually contain links referring to next inappropriate sites) and to find servers containing most of these sites.

The goal of this system is to facilitate the work of governmental institutions in preventing and combating Internet crime. Next usability is mainly in public institutions (schools, universities, libraries).
Status: Finished

People on this project:

Roman Tesa┼Ö

Phone: +420 377632479

Roman is a PhD student at the Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia in Pilsen, Czech Republic. His work is focused on the utilization of word n-grams in text classification and document filtering.

Karel Je┼żek

Phone:  +420 377632475

Karel is the former group coordinator and a supervisor of PhD students working at research projects of this Group.



Extending the Single Words-Based Document Model: A Comparison of Bigrams and 2-Itemsets

Authors:  Roman Tesa┼Ö, Massimo Poesio, V├íclav Strnad, Karel Je┼żek
Source:The 2006 ACM Symposium on Document Engineering(DocEngÔÇÖ06), Amsterdam, Netherlands, ACM press (New York, NY, USA), ISBN 1-59593-515-0, pages 138-146.
Download: download Full text 

A comparison of two algorithms for discovering repeated word sequences

Authors:  Roman Tesa┼Ö, Dalibor Fiala, Fran├žois Rousselot, Karel Je┼żek
Source:The 6th International Conference on Data Mining, Text Mining and their Business Applications (Data Mining 2005), Skiathos, Greece, ISBN 1-84564-017-9, pages121-131, WIT Transaction on Information and Communication Technologies, ISSN 1743-3517.
Download: download Full text [245 kB]
View record in Web of Science®

In Czech: Klasifikace Suffix Tree fr├ízemi - srovn├ín├ş s metodou Itemsets

Authors:  Roman Tesa┼Ö, Karel Je┼żek
Source:Znalosti 2005 conference, Stará Lesná, Slovakia, ISBN 80-248-0755-6, pages 144-153.
Download: download Full text [301 kB]