masters_thesis_fournier_2024_xpln_ktest.pdf

Introduction

The complexity of biological systems is intricately tied to the regulation of gene expression, which governs the cellular activities that define life. At the heart of every living organism, the cell operates as the fundamental unit, housing genetic material organized into chromosomes within the nucleus, alongside other essential organelles distributed in the cytoplasm. Cellular activity is primarily directed by the expression of genes encoded in nuclear DNA, where the transcription of these genes produces messenger RNA (mRNA). This mRNA is subsequently translated into proteins, which carry out various cellular functions. This gene expression process is not static; it is regulated and subject to random variations, reflecting the inherent stochastic nature of cellular behavior.

Understanding these variations is crucial for elucidating the molecular basis of cellular behavior and the underlying causes of various diseases. The study of the complete set of RNA transcripts produced by the genome under specific conditions or within particular cell types has emerged as a pivotal area in molecular biology. By analyzing the transcriptome—the entire collection of mRNA molecules in a cell or population of cells—researchers gain profound insights into gene expression patterns, regulatory mechanisms, and the functional elements of the genome. This analysis not only sheds light on the functional elements of the genome but also aids in identifying biomarkers and therapeutic targets for various diseases. The dynamic nature of the transcriptome, reflecting how gene expression varies across different conditions, time points, or cellular environments, makes this area of study an invaluable tool for understanding gene regulation and cellular function on a genome-wide scale. The advent of high-throughput sequencing technologies has significantly advanced the field, enabling researchers to explore the complexities of gene expression with unprecedented depth and precision.

As research in this field progresses, the need for more sophisticated analytical tools becomes evident, particularly when dealing with the high-dimensional and often sparse data produced by single-cell experiments. Traditional methods, which predominantly focus on univariate comparisons, are limited in their ability to capture the complex, multivariate relationships inherent in these datasets. Such limitations can result in a loss of critical information, hindering the identification of subtle yet biologically significant patterns of gene expression.

To address these challenges, this report explores the use of kernel-based distribution comparison tests in the context of this research. Kernel methods, recognized for their flexibility and robustness in managing complex data structures, provide a powerful alternative to traditional linear techniques, particularly in scenarios where non-linear relationships are prevalent. The application of kernel testing techniques enables the detection of complex patterns that conventional approaches might overlook.

However, as the complexity of these analytical methods increases, so does the necessity for enhanced interpretability. The importance of explainability in machine learning, particularly in biological research, cannot be overstated. Researchers must not only understand whether differences in gene expression exist between conditions but also discern which specific genes or features contribute to these differences. In this context, the report introduces a novel framework that enhances the interpretability of kernel-based methods through sensitivity analysis. This framework employs advanced tools such as Sobol’ indices and Derivative-based Global Sensitivity Measures (DGSMs) to quantitatively assess the influence of individual genes on the observed differences between cellular populations.

The structure of this report reflects the progression from theoretical foundations to practical implementations. Beginning with Chapter 1, the foundational concepts are introduced, exploring the significance of differential analysis in biological experiments and the application of kernel-based statistical tests for analyzing complex gene expression patterns.

The discussion then transitions in Chapter 2 to the critical role of explainable machine learning in biological research, emphasizing the necessity for interpretability in analytical models. This chapter introduces advanced methodologies from sensitivity analysis and proposes new interpretability tools for distribution-comparison.

Finally, Chapter 3 focuses on the practical implementation of these theoretical concepts. It provides detailed applications of the interpretability tools to kernel methods, demonstrating how they can be effectively utilized in real-world biological data analysis.

Kernel Distribution Comparison with Transcriptomic Data

Differential Expression Analysis and Single-Cell Experiments

Introduction to Differential Expression Analysis

Differential Expression Analysis (DEA) is a cornerstone technique in transcriptomics, enabling researchers to identify genes that are expressed at significantly different levels under varying conditions. Originally developed for bulk RNA sequencing (RNA-Seq) data, DEA compares the average gene expression levels between two or more conditions, such as diseased versus healthy tissues, to detect differentially expressed genes (DEGs) [@dea_robinson]. The identification of DEGs provides crucial insights into the molecular mechanisms underlying specific phenotypes, guiding further experimental validations, such as gene knockouts or overexpression studies, to confirm the functional roles of these genes.

In bulk RNA-Seq, DEA operates under the assumption that the transcriptomic data represents an average gene expression across a population of cells. This assumption, however, overlooks the inherent heterogeneity within cellular populations, where individual cells might exhibit distinct expression profiles that are masked by population averaging. Consequently, bulk RNA-Seq and associated DEA methods fall short in capturing the granularity of gene expression at the single-cell level. This granularity is particularly important in studies of complex tissues, where diverse cell types, states, and even rare sub-populations can coexist, contributing to the overall tissue function and response to stimuli.

Advancements in Single-Cell Sequencing Technologies

Single-cell RNA sequencing (scRNA-Seq) is one of the most widely used technologies for measuring gene expression at the single-cell level. By isolating and sequencing mRNAs from individual cells, scRNA-Seq captures a representative subset of the transcripts present in the cytoplasm [@single_cell_gawad].

The resulting data for a certain population ℓ is typically organized in a high-dimensional matrix $\Ybf_\ell = (Y_{\ell, i}){1\leq i \leq n\ell} \in \R^{n\times p}$, where rows are indexed by 1 ≤ i ≤ nℓ and represent cells – or observations –, and where columns are indexed by 1 ≤ j ≤ p and represent transcripts – or features –. Thus, the component $Y_{\ell, i}^j \in \R$ corresponds to the expression level of the gene j in the cell i from population ℓ.

The large scale and high dimensionality of scRNA-Seq datasets, may it be the number of cells nℓ or the number of genes p, present significant challenges, both in terms of computational resources required for analysis and the statistical difficulties inherent in high-dimensional data, commonly referred to as the curse of dimensionality [@high_dim_stats_giraud].

These datasets are sparse due to the fact that not all genes are expressed in every cell, and the expressed genes are not continuously active. The technical steps involved in generating scRNA-Seq data – such as droplet-based single-cell isolation, mRNA library preparation, and sequencing – introduce biases that can be difficult to distinguish from the biological variability inherent to individual cells [@single_cell_contamination_young]. As a result, efficient algorithms and analytical methods are needed to process and interpret these complex datasets.