Automatic keyphrase extraction based on NLP and statistical methods

In this article we would like to present our experimental approach to automatic keyphrase extraction based on statistical methods and Wordnet-based pattern evaluation. Automatic keyphrases are important for automatic tagging and clustering because manually assigned keyphrases are not sufficient in most cases. Keyphrase candidates are extracted in a new way derived from a combination of graph methods (TextRank) and statistical methods (TF*IDF). Keyword candidates are merged with named entities and stop words according to NL POS (Part Of a Speech) patterns. Automatic keyphrases are generated as TF*IDF weighted unigrams. Keyphrases describe the main ideas of documents in a human-readable way. Evaluation of this approach is presented in articles extracted from News web sites. Each article contains manually assigned topics/categories which are used for keyword evaluation.

Keywords: keyphrase extraction, Wordnet, TextRank, TFIDF, NLP

Year: 2011

Authors of this publication:

Martin Dostal


Martin graduated from the University of West Bohemia in 2009, specialized in software engineering. He is interested in the semantic Web, information retrieval, and question answering.

Karel Je┼żek

Karel is a group coordinator and a supervisor of PhD students working at research projects of this Group.

Related Projects:


Document Clustering and Linked Data

Authors:  Karel Je┼żek, Martin Dostal
Desc.:Unsupervised methods for automatic tagging and clustering based on information extraction from Linked data.