Text-Mining Research Group

Dictionaries On-line

GNU/FDL English-Czech Dictionary
Desc.:	English-Czech online dictionary mostly based on the i-spell database.

SPOT On-line Dictionary
Desc.:	Still in progress - may be temporarily unavailable Dictionary administered and updated by volunteers under the Dept. of Computer Science and Engineering at the University of West Bohemia. Terminology focused mostly on the areas of computer science and engineering.

Other Research Groups

Amphora Research Group at VSB-TU Ostrava
Desc.:	The research activities of ARG relate mostly to applications in Information Retrieval and other related disciplines (e.g. data indexing and storage, data modeling, data compress, text retrieval etc.).

Knowledge Discovery Group at FI MU Brno
Desc.:	Knowledge Discovery Group aims at the development of pre-processing methods for data mining, natural language learning, mining in spatio-temporal data, difficult patterns classification, integration data mining tools with database systems and other.

Natural Language Engineering and Web Applications Group at University of Essex

Desc.:

The aim of research in Natural Language Engineering (NLE) is to endow computer systems with the ability to process natural language. This ability is essential for applications such as information retrieval and web search, information extraction and data mining, text summarization, and speech technology. NLE techniques for morphological analysis , part-of-speech tagging, word prediction, or term extraction are already in use in real-world applications in these a reas, and the technology required for applications such as news summarization or spoken dialogue systems (e.g., systems that can engage in a dialogue with customers to give information about train timetables) is already at a very advanced state of development.

Text Corpora - Useful Sources

American National Corpus
Desc.:	The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development.

British National Corpus
Desc.:	The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.

Czech National Corpus
Desc.:	The Czech National Corpus (CNC) is a non-commercial, academic project focused on building up a large computer-based corpus, containing mainly written Czech. CNC presents a very large, modern and valuable language and informational base.

Michigan Corpus of Academic Spoken English
Desc.:	A collection of transcripts of academic speech events recorded at the University of Michigan.

Text Corpora and Corpus Linguistics
Desc.:	Useful information on text corpora and concordancing.The site was originally a Corpus Linguistics site at Rice University.

Text Corpus Toolkit
Desc.:	The Text Corpus Toolkit is a web application designed to facilitate analysis and administration of various text corpora via a simple web interface. Standard text collections include Reuters, Enron spam, Ling spam, 20Newsgroups, and others. The Toolkit can be used by text-mining researchers to generate various statistics on text corpora.

UCL Survey of English Usage
Desc.:	The Survey of English Usage carries out research in English Linguistics and was the first centre in Europe to do research with corpora. The Survey is based in the Department of English Language and Literature at UCL.

Text-Mining Research Group

University of West Bohemia

Dictionaries On-line

Other Research Groups

Text Corpora - Useful Sources