
A comparison of two algorithms for discovering repeated word sequences
We will make the readers of this paper familiar with two basic approaches to repeated sequences extraction – a suffix tree based method and an inverted list based method. The first algorithm (ST) makes use of a tree data structure known from suffix tree clustering (STC) where each node represents one word and the root represents the null word. Thus, each path from the root is a phrase occurring somewhere in the corpus. The second applies an inverted index (broadly used in information retrieval), more specifically its hash table implementation (HT). Occurrences of each word in this index are employed to construct lists of wordsfollowing the current word in different places of the corpus. These lists are then modified in a recursive fashion to satisfy the constraints of repeated segments. We will introduce both methods and compare them in terms of their time and space complexity, efficiency, effectiveness and results yielded with some sample input data. We will conclude that whereas the ST-algorithm is better as to time cost, it is outperformed by the HT-approach with respect to space. Finally, we will suggest several possible applications of repeated sequences.
The available full text is a preprint of the article.
Keywords: repeated sequences, repeated segments, frequent phrases, suffix tree, algorithms, comparison
Year: 2005

Authors of this publication:

Roman Tesař
Phone: +420 377632479
E-mail: roman.tesar@gmail.com
WWW: http://www.sweb.cz/romant1/CV.pdf

Dalibor Fiala
Phone: +420 377 63 2429
E-mail: dalfia@kiv.zcu.cz
WWW: http://www.kiv.zcu.cz/~dalfia/

François Rousselot
E-mail: francois.rousselot@insa-strasbourg.fr

Karel Ježek
Phone: +420 377632475
E-mail: jezek_ka@kiv.zcu.cz
WWW: https://cs.wikipedia.org/wiki/Karel_Je%C5%BEek_(informatik)
Related Projects:

Internet Content Filtering | |
Authors: | Roman Tesař, Karel Ježek |
Desc.: | This project includes Web sites processing, analyzing, classification by means of their content and searching for other Web sites with similar content. |