Text-Mining Research Group

Exploration of Semantic Spaces Obtained from Czech Corpora

This paper is focused on semantic relations between Czech words. Knowledge of these relations is crucial in many research fields such as information retrieval, machine translation or document clustering. We obtained these relations from newspaper articles. With the help of LSA, HAL and COALS algorithms, many semantic spaces were generated. Experiments were conducted on various settings of parameters and on different ways of corpus preprocessing. The preprocessing included lemmatization and an attempt to use only "open class" words. The computed relations between words were evaluated using the Czech equivalent of the Rubenstein-Goodenough test. The results of our experiments can serve as the clue whether the algorithms (LSA, HAL and COALS) originally developed for English can be also used for Czech texts.

Keywords: Information retrieval, Semantic space, LSA, HAL, COALS,Rubenstein-Goodenough test

Year: 2011

Download:

Full text [691 kB]

Authors of this publication:

Lubomír Krčmář

E-mail: lkrcmar@kiv.zcu.cz

Luboš graduated from the University of West Bohemia in 2009. He is a PhD student now. His research is focused on natural language processing, information retrieval, and semantic similarity of texts of varying length. Especially, he is interested in automatic extraction of collocations and idiomatic expression from large corpora.

Karel Ježek

Phone: +420 377632475
E-mail: jezek_ka@kiv.zcu.cz
WWW: https://cs.wikipedia.org/wiki/Karel_Je%C5%BEek_(informatik)

Karel is the former group coordinator and a supervisor of PhD students working at research projects of this Group.

Miloslav Konopík

Related Projects:

Exploration of Semantic Spaces
Authors:	Karel Ježek, Lubomír Krčmář, Miloslav Konopík
Desc.:	This work is focused on semantic relations between words and application of these relations in research fields such as information retrieval, machine translation or document clustering.