
Automatic Dialog Act Corpus Creation from Web Pages
This work presents two complementary tools dedicated to the task of textual corpus creation for linguistic researches. The chosen application domain is automatic dialog acts recognition, but the proposed tools might also be applied to any other research area that is concerned with dialogs processing. The first software captures relevant dialogs from freely available resources on the World Wide Web. Filtering and parsing of these web pages is realized thanks to a set of hand-crafted rules. A second set of rules is then applied to achieve automatic segmentation and dialog act tagging. The second software is finally used as a post-processing step to manually check and correct tagging errors when needed. In this paper, both softwares are presented, and the performances of automatic tagging are evaluated on a dialog corpus extracted from an online Czech journal. We show that reasonably good dialog act labeling accuracy may be achieved, hence greatly reducing the costof building such corpora.
Keywords: automatic labeling, corpus, dialog act, Internet
Year: 2010

Authors of this publication:

Pavel Král
Phone: +420 377 632 454
E-mail: pkral@kiv,zcu.cz
WWW: http://home.zcu.cz/~pkral/