Extracting Information from Web Content and Structure

Extracting Information from Web Content and Structure

Web is a vast data repository. By mining from this data efficiently, we can gain valuable knowledge. Unfortunately, in addition to useful content there are also many Web documents considered harmful (e.g. pornography, terrorism, illegal drugs). Web mining that includes three main areas ÔÇô content, structure, and usage mining ÔÇô may help us detect and eliminate these sites. In this paper, we concentrate on applications of Web content and Web structure mining. First, we introduce a system for detection of pornographic textual Web pages. We discuss its classification methods and depict its architecture. Second, we present analysis of relations among Czech academic computer science Web sites. We give an overview of ranking algorithms and determine importance of the sites we analyzed.

Keywords: Web mining, information retrieval, classification, ranking algorithms

Year: 2006

Download: download Full text [213 kB]

Authors of this publication:


Dalibor Fiala


E-mail: dalfia@kiv.zcu.cz
WWW: http://www.kiv.zcu.cz/~dalfia/

Dalibor is an associate professor at the Department of Computer Science and Engineering at the University of West Bohemia in Pilsen, Czech Republic. He is interested in web mining, information retrieval, and information science.

Roman Tesa┼Ö


Phone: +420 377632479
E-mail: roman.tesar@gmail.com
WWW: http://www.sweb.cz/romant1/CV.pdf

Roman is a PhD student at the Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia in Pilsen, Czech Republic. His work is focused on the utilization of word n-grams in text classification and document filtering.

Karel Je┼żek


Phone:  +420 377632475, 377632400
E-mail: jezek_ka@kiv.zcu.cz
WWW: http://www-kiv.zcu.cz/~jezek_ka/

Karel is a group coordinator and a supervisor of PhD students working at research projects of this Group.

Fran├žois Rousselot


E-mail: francois.rousselot@insa-strasbourg.fr

Fran├žois is interested in knowledge acquisition and computational linguistics.

Related Projects:


Project

Extracting Information from Web Content and Structure

Authors:  Dalibor Fiala, Roman Tesa┼Ö, Karel Je┼żek
Desc.:This project deals with classification of Web documents and determination of authoritative Web sites. It was supported in part by the Ministry of Education of the Czech Republic under grant FRVS 1347/2005/G1.