Text Indexing for Simple Regular Expressions

  • Hideo Bannai
  • , Philip Bille
  • , Inge Li Gørtz
  • , Gad M. Landau
  • , Gonzalo Navarro
  • , Nicola Prezza
  • , Teresa Anna Steiner
  • , Simon Rumle Tarnow

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We study the problem of indexing a text T[1..n] ∈ Σn so that, later, given a query regular expression pattern R of size m = |R|, we can report all the occ substrings T[i..j] of T matching R. The problem is known to be hard for arbitrary patterns R, so in this paper, we consider the following two types of patterns. (1) Character-class Kleene-star patterns of the form P1DP2, where P1 and P2 are strings and D = {c1, . . ., ck} ⊂ Σ is a character-class (shorthand for the regular expression (c1|c2|··· |ck)) and (2) String Kleene-star patterns of the form P1PP2 where P, P1 and P2 are strings. In case (1), we describe an index of O(nlog1+ϵ n) space (for any constant ϵ > 0) solving queries in time O(m + log n/log log n + occ) on constant-sized alphabets. We also describe a general solution for any alphabet size. This result is conditioned on the existence of an anchor: a character of P1P2 that does not belong to D. We justify this assumption by proving that no efficient indexing solution can exist if an anchor is not present unless the Set Disjointness Conjecture fails. In case (2), we describe an index of size O(n) answering queries in time O(m + (occ + 1) logϵ n) on any alphabet size.

Original languageEnglish
Title of host publication36th Annual Symposium on Combinatorial Pattern Matching, CPM 2025
EditorsPaola Bonizzoni, Veli Makinen
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
ISBN (Electronic)9783959773690
DOIs
StatePublished - 10 Jun 2025
Event36th Annual Symposium on Combinatorial Pattern Matching, CPM 2025 - Milan, Italy
Duration: 17 Jun 202519 Jun 2025

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume331
ISSN (Print)1868-8969

Conference

Conference36th Annual Symposium on Combinatorial Pattern Matching, CPM 2025
Country/TerritoryItaly
CityMilan
Period17/06/2519/06/25

Bibliographical note

Publisher Copyright:
© Hideo Bannai, Philip Bille, Inge Li Gørtz, Gad M. Landau, Gonzalo Navarro, Nicola Prezza, Teresa Anna Steiner, and Simon Rumle Tarnow;

Keywords

  • Text indexing
  • data structures
  • regular expressions

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'Text Indexing for Simple Regular Expressions'. Together they form a unique fingerprint.

Cite this