This dissertation investigates the computational foundations of sequence analysis in the context of pangenomics, a rapidly evolving field in computational biology. Classical string algorithms, originally developed for linear DNA sequences, encounter new challenges when extended to nonlinear, graph-like pangenome representations. To address this, we study variable stringsāgeneralized models for representing sets of similar sequences compactly, including elastic-degenerate strings (ED strings), founder graphs, and weighted sequences. The thesis makes three core contributions. First, it explores exact and approximate pattern matching algorithms for variable strings, establishing tight upper and lower bounds. Second, it introduces novel methods for comparing pangenomic data structures, including algorithms for intersection detection, matching statistics, and distance-based comparisons. Third, it proposes space-efficient indexing strategies for weighted sequences, enabling probabilistic pattern queries under uncertainty.
More information on the thesis