A research group focused on knowledge mining from texts was established at the Department of Computer Science and Engineering in 1999. The original two-member team, consisting of Karel Ježek (a supervisor) and Jiří Hynek (then a Ph.D. student), was gradually augmented by other computer science students in both masters and doctoral programs. The team chose the name Text-Mining Research Group, abbreviated to TMRG.

Our activities were formerly supported by the “Information Systems and Technologies” research program, and a research grant entitled “Cooperation of Technical Universities and the State in the Fight against Computer Crime” awarded by the Ministry of Education. Our current funding comes from the IInd National Research Program – “Natural Language Communication with the Semantic Web” research grant.

The exponential growth of information available on the web, in electronic databases and libraries prevents its manual analysis. Automatic information processing is therefore a great challenge, attracting a number of researchers working on the interface between computer science and linguistics.

The tasks we deal with are characterized by high dimensionality along with high volume of data being processed. It is therefore essential to reduce the dimensionality of the problem and find efficient algorithms for data storage, management, selection and processing.

We work on tasks that involve classification, searching, filtering, clustering, and summarization methods designed for extensive text- and hypertext databases.

Our methods can be applied in tasks such as automatic filtering of unsolicited mail or web pages, information search refinement, generation of abstracts and summaries, detection of illegal web sites, analysis of authority ranking of web sites, etc.

We have also initiated a new on-line Czech-English dictionary project (SPOT) with the aim of standardizing the Czech technical terminology currently used by computer and information professionals.

Our long-term objective is to create a robust system to extract knowledge from semi-structured data in a multi-language web environment in order to infer new information / knowledge that is not contained explicitly in the original data.

