Commas recovery with syntactic features in French and in Czech

Commas recovery with syntactic features in French and in Czech

Automatic speech transcripts can be made more readable and useful for further processing by enriching them with punctuation marks and other meta-linguistic information. We study in this work how to improve automatic recovery of one of the most difficult punctuation marks, commas, in French and in Czech. We show that commas detection performances are largely improved in both languages by integrating into our baseline Conditional Random Field model syntactic features derived from dependency structures. We further study the relative impact of language-independent vs. specific features, and show that a combination of both of them gives the largest improvement. Robustness of these features to speech recognition errors is finally discussed.

Keywords: commas recovery, conditional random fields, Czech, dependency parsing, French, punctuation detection

Year: 2011

Download: download Full text [150 kB]

Authors of this publication:

Pavel Kr├íl

Phone: +420 377 632 454
E-mail: pkral@kiv,

Pavel is a lecturer/researcher at the Department of Computer Science and Engineering at the University of West Bohemia in Pilsen (Czech Republic). His research is focused on automatic speech processing, dialog act recognition, syntactic parsing, punctuation annotation and document classification.