
Extracting Information from Web Content and Structure
Web is a vast data repository. By mining from this data efficiently, we can gain valuable knowledge. Unfortunately, in addition to useful content there are also many Web documents considered harmful (e.g. pornography, terrorism, illegal drugs). Web mining that includes three main areas – content, structure, and usage mining – may help us detect and eliminate these sites. In this paper, we concentrate on applications of Web content and Web structure mining. First, we introduce a system for detection of pornographic textual Web pages. We discuss its classification methods and depict its architecture. Second, we present analysis of relations among Czech academic computer science Web sites. We give an overview of ranking algorithms and determine importance of the sites we analyzed.
Keywords: Web mining, information retrieval, classification, ranking algorithms
Year: 2006

Authors of this publication:

Dalibor Fiala
Phone: +420 377 63 2429
E-mail: dalfia@kiv.zcu.cz
WWW: http://www.kiv.zcu.cz/~dalfia/

Roman Tesař
Phone: +420 377632479
E-mail: roman.tesar@gmail.com
WWW: http://www.sweb.cz/romant1/CV.pdf

Karel Ježek
Phone: +420 377632475
E-mail: jezek_ka@kiv.zcu.cz
WWW: https://cs.wikipedia.org/wiki/Karel_Je%C5%BEek_(informatik)

François Rousselot
E-mail: francois.rousselot@insa-strasbourg.fr
Related Projects:

Extracting Information from Web Content and Structure | |
Authors: | Dalibor Fiala, Roman Tesař, Karel Ježek |
Desc.: | This project deals with classification of Web documents and determination of authoritative Web sites. It was supported in part by the Ministry of Education of the Czech Republic under grant FRVS 1347/2005/G1. |