Toward efficient algorithms for sublinear search in large genome databases

Léo Ackermann

IRISA, Rennes

December 11, 2025, 12:20 in S6

Abstract

The explosion of genomic data presents a significant computational challenge for all downstream analyses. Particularly challenging are microbial collections, which already encompass several millions of genomes (e.g., AllTheBacteria: 2.4M bacterial isolates; GISAID: 15M SARS-CoV-2 sequences) and are further growing exponentially. Many standard bioinformatics analyses lean on pairwise distances between genomes, including phylogeny inference and genome clustering. As a  result, the scalability of multiple downstream analyses relies on the efficient computation and storage of pairwise distance matrices.

However, while the efficient computation of distances have been addressed by modern sketching-based methods such as Mash and successors, the storage and indexing of the resulting matrices remain a significant challenge. In fact, due to their quadratic size in the number of genomes, these matrices already surpass most storage capacities (e.g., 24 TB required for AllTheBacteria) and are thus heavily truncated when stored. This calls for a dedicated data structure that would be space-efficient and support near-constant-time distance retrieval queries. However, despite all the recent advances in compression of restricted families of matrices, such as sparse or low-rank ones, to the best of our knowledge, no scalable method is currently available for large genome-distance matrices.

In this presentation, we will discuss our ongoing work on subquadratic compression of distance matrices of large bacterial genome collections. Building upon insights from prior work, our approach takes advantage of the peculiar structure of those collections, that can extensively be explained by their underlying phylogeny. As a first step, we focus on phylogenetic trees as a candidate backbone for a compact data structure for pairwise distances.