Text-Mining Research Group

A comparison of two algorithms for discovering repeated word sequences

We will make the readers of this paper familiar with two basic approaches to repeated sequences extraction – a suffix tree based method and an inverted list based method. The first algorithm (ST) makes use of a tree data structure known from suffix tree clustering (STC) where each node represents one word and the root represents the null word. Thus, each path from the root is a phrase occurring somewhere in the corpus. The second applies an inverted index (broadly used in information retrieval), more specifically its hash table implementation (HT). Occurrences of each word in this index are employed to construct lists of wordsfollowing the current word in different places of the corpus. These lists are then modified in a recursive fashion to satisfy the constraints of repeated segments. We will introduce both methods and compare them in terms of their time and space complexity, efficiency, effectiveness and results yielded with some sample input data. We will conclude that whereas the ST-algorithm is better as to time cost, it is outperformed by the HT-approach with respect to space. Finally, we will suggest several possible applications of repeated sequences.
The available full text is a preprint of the article.

Keywords: repeated sequences, repeated segments, frequent phrases, suffix tree, algorithms, comparison

Year: 2005

Download:

Full text [245 kB]

View record in Web of Science®

Authors of this publication:

Roman Tesař

Phone: +420 377632479
E-mail: roman.tesar@gmail.com
WWW: http://www.sweb.cz/romant1/CV.pdf

Roman is a PhD student at the Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia in Pilsen, Czech Republic. His work is focused on the utilization of word n-grams in text classification and document filtering.

Dalibor Fiala

Phone: +420 377 63 2429
E-mail: dalfia@kiv.zcu.cz
WWW: http://www.kiv.zcu.cz/~dalfia/

Dalibor is the research group coordinator and an associate professor at the Department of Computer Science and Engineering at the University of West Bohemia in Pilsen, Czech Republic. He is interested in data mining, web mining, information retrieval, informetrics, and information science.

François Rousselot

E-mail: francois.rousselot@insa-strasbourg.fr

François is interested in knowledge acquisition and computational linguistics.

Karel Ježek

Phone: +420 377632475
E-mail: jezek_ka@kiv.zcu.cz
WWW: https://cs.wikipedia.org/wiki/Karel_Je%C5%BEek_(informatik)

Karel is the former group coordinator and a supervisor of PhD students working at research projects of this Group.

Related Projects:

Internet Content Filtering
Authors:	Roman Tesař, Karel Ježek
Desc.:	This project includes Web sites processing, analyzing, classification by means of their content and searching for other Web sites with similar content.