Identifying Novel Information using Latent Semantic Analysis in the WiQA Task at CLEF 2006

Identifying Novel Information using Latent Semantic Analysis in the WiQA Task at CLEF 2006

From the perspective of WiQA, the Wikipedia can be considered as a set of articles each having a unique title. In the WiQA corpus articles are divided into sentences (snippets) each with its own identifier. Given a title, the task is to find snippets which are Important and Novel relative to the article. We indexed the corpus by sentence using Terrier. In our two-stage system, snippets were first retrieved if they contained an exact match with the title. Candidates were then passed to the Latent Semantic Analysis component which judged them Novel if they did not match the text of the article. The test data was varied – some articles were long, some short and indeed some were empty! We prepared a training collection of twenty topics and used this for tuning the system. During evaluation on 65 topics divided into categories Person, Location, Organization and None we submitted two runs. In the first, the ten best snippets were returned and in the second the twenty best. Run 1 was best with Average Yield per Topic 2.46 and Precision 0.37. We also studied performance on six different topic types: Person, Location, Organization and None (all specified in the corpus), Empty (no text) and Long (a lot of text). Precision results in Run 1 for Person and Organization were good (0.46 and 0.44) and were worst for Long (0.24). Compared to other groups, our performance was in the middle of the range except for Precision where our system was equal to the best. We attribute this to our use of exact title matches in the IR stage. We found that judging snippets Novel when preparing training data was fairly easy but that Important was subjective. In future work we will vary the approach used depending on the topic type, exploit co-references in conjunction with exact matches and make use of the elaborate hyperlink structure which is a unique and most interesting aspect of Wikipedia.

Keywords: Question answering, latent semantic analysis, information filtering

Year: 2007

Journal ISSN: 0302-9743
Download: download Full text 
View record in Web of Science®

Authors of this publication:

Richard F. E. Sutcliffe

Richard is an academic visitor at the Department of Computer Science, University of Essex.

Josef Steinberger


Josef is an associated professor at the Department of computer science and engineering at the University of West Bohemia in Pilsen, Czech Republic. He is interested in media monitoring and analysis, mainly automatic text summarisation, sentiment analysis and coreference resolution.

Udo Kruschwitz


Udo is a lecturer at the Department of Computer Science, University of Essex.

Related Projects:


Automatic Text Summarisation

Authors:  Josef Steinberger, Karel Ježek, Michal Campr, Jiří Hynek
Desc.:Automatic text summarisation using various text mining methods, mainly Latent Semantic Analysis (LSA).