Text-Mining Research Group

Multilingual Media Monitoring and Text Analysis – Challenges for Highly Inflected Languages

We present the highly multilingual news analysis system Europe Media Monitor (EMM), which gathers an average of 175,000 online news articles per day in tens of languages, categorises the news items and extracts named entities and various other information from them. We also give an overview of EMM’s text mining tool set, focusing on the issue of how the software deals with highly inflected languages such as those of the Slavic and Finno-Ugric language families. The questions we ask are: How to adapt extraction patterns to such languages? How to de-inflect extracted named entities? And: Will document categorisation benefit from lemmatising the texts?

Keywords: multilingual news analysis, Europe Media Monitor, categorisation, NER

Year: 2013

Download:

Full text

Authors of this publication:

Ralf Steinberger

Maud Ehrmann

Julia Pajzs

Mohamed Ebrahim

Josef Steinberger

E-mail: jstein@kiv.zcu.cz

Josef is an associated professor at the Department of computer science and engineering at the University of West Bohemia in Pilsen, Czech Republic. He is interested in media monitoring and analysis, mainly automatic text summarisation, sentiment analysis and coreference resolution.