Chapter 5 General conclusion and outlook

Chapter 2 and 3 of this thesis focussed on broadening the scope of the functional analysis tools PROGENy and DoRothEA by thorough benchmarking studies.

In the past, both tools had been shown to provide valuable mechanistic insight by inferring pathway and transcription factor activities from human bulk transcriptome data (Garcia-Alonso et al., 2019; Schubert et al., 2018). Motivated by the fact that in my project within the “Liver Systems Medicine” (LiSyM) network many different gene expression data sets from mouse models were generated I strived for analyzing and characterizing also data from this model organism with PROGENy and DoRothEA. However, it was not clear whether both tools could provide biologically meaningful insight from mouse transcriptome data. For this purpose, I developed a systematic benchmarking pipeline where I showed that it is possible to transfer the regulatory knowledge of PROGENy and DoRothEA from human to mouse to functionally characterize also mice data.

With the emergence of scRNA-seq data, there was a growing need for functional analysis tools to analyze this novel data type. In the early days of this technology, many tools developed for bulk transcriptome analysis were readily applied to scRNA-seq data without any reasonable justification. My benchmarking study about the robustness and applicability of transcription factor and pathway analysis tools on scRNA-seq data was one of the first attempts to systematically evaluate the performance of bulk and scRNA-seq based tools. In summary, I was able to show that PROGENy and DoRothEA i) are robust against low gene coverage, i.e. drop-outs, ii) detect experimentally perturbed TFs/pathways with moderate accuracy iii) preserve cell-type-specific information while reducing noise in parallel, and iv) provide biologically meaningful activity scores.

Both benchmark studies were highly dependent on collecting and curating appropriate pathways and TF perturbation experiments as ground truth for the benchmark. Hence I mined the largest publicly available repositories of gene expression data such as Gene Expression Omnibus and Array Express to identify suitable experiments. For the cross-species benchmark, this endeavor was significantly facilitated by mining the then recently published CREEDS database containing the metadata of thousands of manually curated microarray data of drug and gene perturbation experiments for humans and mice (Zichen Wang et al., 2016). Although the scRNA-seq benchmark study was by far more complex in terms of included data than the cross-species benchmark, I needed also for the latter project a large collection of pathway and TF perturbation experiments. I was able to expand my previous collection of perturbation experiments with mostly further TF perturbation experiments that were previously collected and curated by Keenan et al. -Keenan et al. (2019) for the benchmark of the TF analysis tools ChEA3. In summary, both benchmark studies’ feasibility and ultimate success were primarily made possible by the scientific community, who made their datasets or databases freely and publicly available.

Chapter 4 of this thesis demonstrated how functional analysis tools can provide meaningful insight from transcriptome data. In particular, I studied the similarities and differences in gene expression changes of acute and chronic liver disease in humans and mice. By a systematic analysis, I was able to identify gene sets containing i) genes similar altered between mouse models with chronic damage and liver disease patients or ii) genes exclusively and commonly regulated in chronic and acute liver damage in mice. Each gene set was systematically characterized by applying the tools PROGENy and DoRothEA which was made possible for the mouse-based gene sets by my previous cross-species benchmark. By integrating scRNA-seq I matched commonly deregulated genes in humans and mice to liver-specific cell types. In the future, I envision that the research of the liver and its diseases will benefit greatly from scRNA-seq data, which makes it possible to study the interplay of the individual liver and immune cell types on an unprecedented scale. The first corresponding large-scale data sets have recently been published (Cao et al., 2020; Dobie et al., 2019; Kim, Wu, Allende, & Nagy, 2021; Krenkel, Hundertmark, Ritz, Weiskirchen, & Tacke, 2019; Ramachandran et al., 2019; Segal et al., 2019).

As a side product of the scRNA-seq benchmark study, the results suggested that the performance of TF and pathway analysis tools is more sensitive to the quality of the used prior knowledge in the form of gene sets than the selected statistic to analyze them. This hypothesis laid the foundation for a crowdsourced follow-up project named decoupleR to systematically explore the impact of gene sets and statistics on the performance of functional analysis tools. Initial analyses confirm the hypothesis that well-curated gene sets are the most critical component for this type of analysis. Accordingly, and to make a significant step forward in the development of pathway and TF activity analysis tools, it is crucial to improve the quality of the used prior knowledge. The ever-increasing amount of generated transcriptome data promises a valuable data mine for this purpose. Regarding PROGENy, new pathway footprint signatures could be created or existing ones could be improved by exploiting the vast number of perturbation experiments from the Connectivity Map that systematically generated more than 1,500,000 perturbation signatures (Lamb et al., 2006; Subramanian et al., 2017). In addition, my cross-species benchmark suggests that mouse data could be integrated, but, to avoid additional confounding factors, I recommend relying on human data if possible. DoRothEA’s regulons could be improved by integrating further data modalities such as information about chromatin accessibility generated via ATAC-seq. A recently published cell atlas of chromatin accessibility across 25 human tissues could be a precious data resource to tackle this challenge (Zhang et al., 2021)

Next to the general improvement of the consensus gene sets, there is a pressing need to derive and construct also cell-type-specific gene sets. This is particularly important for gene regulatory networks as different cell types can have fundamentally different gene regulatory programs. Currently, most attempts rely on reverse engineering of such networks from gene expression data of specific cell types or tissues. However, these approaches are mainly based on co-expression or mutual information so that there are many indirect and thus false-positive TF-target interactions (Barbosa, Niebel, Wolf, Mauch, & Takors, 2018). In Garcia-Alonso et al. -Garcia-Alonso et al. (2019), it was shown that a consensus gene regulatory network constructed from various tissues and cell types still outperforms purely data-driven cell-type/tissue-specific networks. However, as soon as the generation of cell-type-specific improves cell type-specific information will be the preferred resource.

In recent years the first platforms to profile the transcriptome spatially resolved became available. This technology is referred to as spatial transcriptomics and promises to study the organization of cells in tissue in unprecedented detail. Hopes and expectations related to spatial transcriptomics were reflected by being awarded the method of the year 2020 by Nature Methods (“Method of the year 2020,” n.d.). In general, spatial transcriptomics resides in terms of covered genes and the number of cells per sample between scRNA-seq and bulk transcriptomics. Considering that I have shown that PROGENy and DoRothEA can be applied to scRNA-seq data and as originally intended to bulk transcriptomics it is reasonable to assume that they should also deliver biologically meaningful results for spatial transcriptome data, though a thorough benchmark study is still outstanding. Nevertheless, both tools have been recently successfully applied to one of the first spatial transcriptome data set of human myocardial infarction providing mechanistic insight into the differentiation of cardiac myofibroblast (Kuppe et al., 2020).

Even though pathway and TF activities alone are meaningful readouts of a cell’s/system’s state they must not be the endpoint of an analysis pipeline. Instead, these can be interpreted as features for further and more sophisticated downstream analyses. For example, Liu et al. -A. Liu et al. (2019) utilize pathway and TF activities to identify and contextualize a causal signaling network from gene expression data using the tool CARNIVAL. Moreover, Tanevski et al. -Tanevski, Ramirez Flores, Gabor, Schapiro, & Saez-Rodriguez (2020) exploit these activities either as a predictor or response variable for a machine learning model named MISTy that aims to explain inter-cellular signaling from spatial transcriptome data.

In summary, I am convinced that the feature space of pathway and TF activities can contribute significantly to decipher the key mechanisms of diseases. For example, as the company DarwinHealth demonstrates, identifying master regulators in the field of personalized healthcare successfully helps identify the right drug at the right time for the right patient (Alvarez et al., 2018). Still, I am looking forward to seeing the impact of the next generation of these types of tools relying on substantially improved and extended prior knowledge.