Automatic keyphrase extraction based on NLP and statistical methods

Automatic keyphrase extraction based on NLP and statistical methods

In this article we would like to present our experimental approach to automatic keyphrase extraction based on statistical methods and Wordnet-based pattern evaluation. Automatic keyphrases are important for automatic tagging and clustering because manually assigned keyphrases are not sufficient in most cases. Keyphrase candidates are extracted in a new way derived from a combination of graph methods (TextRank) and statistical methods (TF*IDF). Keyword candidates are merged with named entities and stop words according to NL POS (Part Of a Speech) patterns. Automatic keyphrases are generated as TF*IDF weighted unigrams. Keyphrases describe the main ideas of documents in a human-readable way. Evaluation of this approach is presented in articles extracted from News web sites. Each article contains manually assigned topics/categories which are used for keyword evaluation.

Keywords: keyphrase extraction, Wordnet, TextRank, TFIDF, NLP

Year: 2011

Authors of this publication:

Martin Dostal


Martin graduated from the University of West Bohemia in 2009, specialized in software engineering. He is interested in the semantic Web, information retrieval, and question answering.

Karel Je┼żek

Phone:  +420 377632475

Karel is the former group coordinator and a supervisor of PhD students working at research projects of this Group.

Related Projects:


Document Clustering and Linked Data

Authors:  Karel Je┼żek, Martin Dostal
Desc.:Unsupervised methods for automatic tagging and clustering based on information extraction from Linked data.