Chapter 1 Introduction

1.1 The central dogma of molecular biology

All living organisms are based on a fundamental principle which is known as the central dogma of molecular biology (Crick, 1970). This dogma describes the information flow from a gene to a protein on the molecular level. Genes are encoded as the majority part of the deoxyribonucleic acid (DNA) and serve as a blueprint for transcripts. During the transcription, genes are copied from the DNA to a Ribonucleic acid (RNA) based transcript that is referred to as messenger RNA (mRNA). The transcription is followed by the process of translation. Thereby the mRNA is translated to a sequence of amino acids that are the building blocks of proteins. The amino acid sequence itself is linear but will form a complex three-dimensional structure.

Molecular Biology can be divided into several branches or disciplines each aiming to analyze a different layer of biological entities. They are based on technologies to quantitatively measure all involved biological molecules at each stage during the information flow from a gene to a protein. The branch genomics, for instance, aims to analyze the entire genome by deciphering the base sequence of the DNA via sequencing technologies. Transcriptomics is highly related to genomics but focuses on all RNA-based transcripts, the transcriptome. Proteomics detects proteins and quantifies their abundance and modifications via mass spectrometry (Altelaar, Munoz, & Heck, 2013). This list of omics technologies is by no means exhaustive as there exist many other branches, such as metabolomics (Patti, Yanes, & Siuzdak, 2012), lipidomics (Wenk, 2005), or epigenomics (Stricker, Köferle, & Beck, 2017) which each analyze their respective molecule class or layer of interest. The work described in this thesis focuses on transcriptomics.

1.2 Transcriptomics

1.2.1 Overview

Transcriptomics is the most widely studied field among the omics disciplines, which is most likely related to ever decreasing costs and to the good coverage of RNAs. The objective of transcriptomics is to quantify the entire transcriptome. Hence, this analysis is not limited to mRNA but comprises also other types of RNA such as ribosomal RNA or transfer RNA. The mRNA information alone is typically referred to as a gene expression profile. These profiles have been proven as a meaningful and interpretable data type as they can be considered as a blueprint of the status of the underlying cell or tissue. Over the years many technologies have been developed to measure the genome-wide expression profile. From the oldest to the most recent methods, all of them owe their existence to the advances in genome sequencing in the 90s and early noughties, particularly the sequencing of the human genome in 2001 (Lander et al., 2001).

1.2.2 Microarrays

One of the oldest but still reasonably popular method makes use of microarrays (Hoheisel, 2006). This technique is based on a chip with attached DNA fragments complementary to the DNA sequence of the genes of interest. Isolated RNA from the sample is reverse transcribed to complementary DNA (cDNA) and labeled with fluorescent molecules. Afterward, the cDNA library is transferred to the chip where cDNA molecules bind to their complement fragment that is attached to the chip. The cDNAs that do not bind are washed off. This setup shows clearly the caveat of microarrays as only the expression of genes for which there are attached complementary sequences on the chip can be quantified. Finally, a laser excites the fluorescence of the paired DNA sequences and considers their emission as a proxy for gene expression. Based on these principles the first samples were analyzed in 2003 with the arrays from Affymetrix. In 2015 the microarray technology reached its peak with over 15,000 samples analyzed and deposited on Gene Expression Omnibus (GEO) annually (Lachmann et al., 2018). Afterward, RNA-sequencing (RNA-seq) replaced microarrays as the most popular method for gene expression profiling.

1.2.3 RNA-sequencing

RNA-seq has a clear advantage over microarrays as theoretically nearly any RNA molecule in a sample can be quantified without prioritizing a priori which genes or transcripts are of interest (Zhong Wang, Gerstein, & Snyder, 2009). This implies that novel or different non-coding transcripts and also splice variants can be detected and quantified. Unlike microarrays, RNA-seq is not framed by background noise and signal saturation and thus has a much higher dynamic range to quantify transcripts (Wilhelm & Landry, 2009; Zhao, Fung-Leung, Bittner, Ngo, & Liu, 2014). Similar to the microarray technology a typical RNA-seq protocol starts with the generation of a cDNA library by RNA isolation and cDNA synthesis via reverse transcription. After amplification of the cDNA library via polymerase chain reaction (PCR), the cDNA molecules are fragmented into smaller so-called reads with a typical length of 50-100 base pairs. This step is crucial for the subsequent sequencing of the reads, as the standard sequencing machines cannot handle larger fragments, though this is changing recently with the emergence of long-read technologies such as Oxford Nanopore Technologies (Amarasinghe et al., 2020). After retrieving the base pair sequence for each read, the reads are mapped back to a representative genome of the respective species, a so-called reference genome. Since the number of reads can easily exceed 10 million per human sample (based on the sequencing depth), this step is highly computationally demanding. Hence, many computationally efficient alignment tools have been developed such as STAR (Dobin et al., 2013) or Kallisto (N. L. Bray, Pimentel, Melsted, & Pachter, 2016). Finally, the number of mapped reads per transcript is counted, which serves as a proxy for gene expression. Until 2019 more than 400,000 samples have been analyzed and deposited on GEO with different versions and protocols of the basic RNA-seq pipeline (Mahi, Najafabadi, Pilarczyk, Kouril, & Medvedovic, 2019). Despite the above-mentioned advantages of RNA-seq over microarrays, microarrays are still used and co-exist with RNA-seq.

While both methods made several important breakthroughs in biomedical research possible in the first place, they suffer from the same limitation. Their measured expression profile is the average of the expression profiles from many different cells or cell types. Therefore, their approach is referred to as bulk transcriptomics. While it is intuitive that highly distinct cell types such as parenchyma and immune cells have completely different transcription programs, it was also possible to show that even the gene expression of similar cell types is heterogeneous (Huang, Sherman, & Lempicki, 2009; Li & Clevers, 2010; Shalek et al., 2014). However, over the past decade, RNA-seq has evolved in such a way that nowadays the expression profile on a single-cell level can be captured.

1.2.4 Single-cell RNA-sequencing

First attempts with single-cell RNA-sequencing (scRNA-seq) were made in 2009 where the transcriptome of a single mouse blastomere was profiled (Tang et al., 2009). This technology promises to capture expression profiles at an unprecedented detail and was awarded as the technology of the year 2013 by Nature Methods (“Method of the year 2013.” 2014). As the term scRNA-seq already indicates, RNA-sequencing is used across the majority of all technologies and protocols to profile the transcriptome. The different protocols vary how transcriptomic profiles are unambiguously mapped back to their origin cell, which is mostly achieved by cellular barcodes and in the construction of the cDNA library. Dependent on the experimental design either a plate or droplet-based approach would be more suitable (Baran-Gale, Chandra, & Kirschner, 2018). Inherently different protocols have different efficiency in capturing transcripts. This leads to a varying complexity of library composition and sensitivity to identify target genes. Recently, the human cell atlas consortium benchmarked 13 different protocols to identify the one with the greatest power of describing and distinguishing cell types and states (Mereu et al., 2020). Over the years the number of cells per study increased exponentially due to the rapid development of the underlying technology or protocol (Svensson, Vento-Tormo, & Teichmann, 2018). In 2017 it was possible to capture around 100,000 cells in a single run using in situ barcoding (Cao et al., 2017; Rosenberg et al., 2018). Nowadays, several million cells can be profiled as demonstrated in a recent study of human organ development where 4,000,000 single-cells have been sequenced (Cao et al., 2020).

Just like for bulk RNA-seq the transcripts must be reversely transcribed to cDNA. However, in a single cell, the number of available transcripts is very low in comparison to the number of transcripts in a bulk approach. Hence, some transcripts may be missed in the process of reverse transcription (Kharchenko, Silberstein, & Scadden, 2014). This can be due to several reasons and is still not fully understood. One essential factor is the gene expression level. Given that a gene is lowly expressed, also a low number of transcripts will be present in a cell which increases the chance that those transcripts are missed during reverse transcription (Kharchenko et al., 2014; Qiu, 2020). However, also the ratio of guanine-cytosine base pairs in the transcript or the enzyme named reverse transcriptase itself might influence whether certain transcripts are reversely transcribed. Accordingly, the missed genes are finally represented in the count matrix with zero counts even though they have been originally expressed in the cell and are thus referred to as drop-outs. Up to 90% of the final gene expression matrix can be zeros and it is not possible to distinguish whether a gene with a count of 0 is a drop-out or has truly not been expressed. Hence, scRNA-seq allows to profile the transcriptome of an enormous amount of cells but with limited gene coverage.

1.2.5 Selected flagship projects

Due to the affordable and continuously decreasing costs of transcriptomic studies, many large flagship projects have been established in the past two decades. The common core of these international and interdisciplinary efforts is to provide the scientific community with a comprehensive database of transcriptomic profiles of various human tissues or phenotypes measured at different resolutions. The following paragraphs briefly summarize selected flagship projects.

1.2.5.1 GTEx

GTEx stands for the Genotype-Tissue Expression Project and was launched in September 2010 by the National Institutes of Health (NIH) (Consortium, 2013). The main objective of GTEx is to provide tissue-specific gene expression profiles obtained from individual donors. In total GTEx provides these profiles for more than 30 distinct tissue types. Scientists worldwide query this database to improve the understanding of human diseases. A more concrete example of how this data is commonly used is the inference of tissue-specific gene regulatory networks via gene expression-based network reconstruction algorithms.

1.2.5.2 TCGA

TCGA stands for The Cancer Genome Atlas Program and was launched already in 2006 by the National Cancer Institute and the National Human Genome Research Institute (Network et al., 2013). Similar to the GTEx project TCGA focuses on individual tissue types, however, the objective is to study the transcriptomic profiles of their respective primary cancer (e.g. hepatocellular carcinoma or lung adenocarcinoma). Furthermore, TCGA also generates genomic, epigenomic, and proteomic data of primary cancers. This enormous data amount (2.5 petabytes) is interrogated to study the development and treatment of cancer either in specified or in multi-omic integration fashion.

1.2.5.3 CMAP

CMAP stands for connectivity map and was initially released in 2006 by the Broad Institute (Lamb et al., 2006). The objective of this project is to generate bulk gene expression signatures upon chemical or genetic perturbation across various human cell lines. Many of those perturbation experiments were also performed with different doses and perturbation times. In 2017 the next generation of CMAP was released which pushed the numbers of total perturbation signatures far beyond 1 million, perturbed by more than 20,000 perturbagenes including the majority of Food and Drug administration-approved drugs (Subramanian et al., 2017). This enormous effort was facilitated by the new high-throughput transcriptomic technology L1000 which lowered the sequencing costs drastically by only quantifying the expression of 978 landmark genes. The expression levels of the remaining genes are computationally inferred. The resulting large dataset enables scientists to systematically compare the signatures within CMAP or with custom gene signatures from e.g. a disease state. Identifying similar or dissimilar pairs and sets of signatures can help to identify novel drug targets or treatments for diseases such as cancer.

1.2.5.4 Human cell atlas

The human cell atlas is the most recent project and was launched in October 2016 (Regev et al., 2017). For a long time, there has been a wish to generate cellular maps of the human body. This idea is similar to GTEx efforts but with a much higher resolution. With the advent of fast-emerging single-cell RNA-seq technologies, this objective is now within reach. The human cell atlas project aims to profile the transcriptome of each cell type in the human body in unprecedented detail. From this dataset, we can learn how tissues are formed, to identify specific subpopulation cell types that drive the progression of a disease. This large-scale effort is still in its infancy, but the first single-cell datasets of various organs have been published which for sure will be a highly valuable resource for the entire scientific community.

1.3 Functional analyses

1.3.1 Overview

In general, there are many types of analyses that can be performed with transcriptomics data. Most commonly, the objective is to identify differences in gene expression levels between groups of samples via differential gene expression analysis. Bulk transcriptomic studies are often designed as perturbation studies to compare treated and untreated samples. In the clinical context, transcriptomic profiles of patients suffering from a certain disease are compared against the profiles of healthy individuals. In studies with animal models, the effect of a drug or a specific treatment can be tested by comparing treated and untreated animals. Since scRNA-seq is still in the early stages and thus expensive most studies do not follow a perturbation-based design, although this will increase in the future. Instead, individual cells of a tissue and organ are investigated. Still, comparisons can be made, e.g. by comparing the expression levels between different cell types of a tissue or organ.

Differential gene expression analysis typically leads to a large list with often more than 1000 significantly altered genes with associated p-value and effect size indicating the significance and magnitude of change in the expression level. Due to the vast number of potentially interesting genes, those lists can be hard to analyze and interpret looking at only a single gene at a time. Functional analysis of transcriptome data is a powerful downstream approach as it summarizes the large and noisy gene expression space into a smaller number of biological meaningful features. The concept behind this methodology is to analyze not the change in expression of individual genes but of groups of genes that are referred to as gene sets. This implies that each functional analysis tool couples a resource of gene sets with a statistical method that aims to analyze those sets.

1.3.2 Gene set types

Regarding gene sets, there is no limitation of how they can be constructed. Typically, gene set members are a collection of genes that share a common biological characteristic or function, such as the association to the same gene ontology term, position on the same chromosome, regulation by a common regulator, or encoding for members of a pathway. Especially the latter gene set type is widely used for classical pathway analysis. There exist many databases that provide those gene sets such as KEGG, REACTOME, PANTHER, or WikiPathways (Jassal et al., 2020; Kanehisa & Goto, 2000; Mi, Muruganujan, Ebert, Huang, & Thomas, 2019; Slenter et al., 2018). A common underlying assumption to summarize the expression of pathway members and then interpreted as pathway activity is that it is assumed that there is a positive correlation between gene expression, protein abundance, and protein activity. Based on those assumptions it follows that given that all genes of a pathway are highly expressed, those proteins are highly abundant and thus highly active. And if all individual proteins of a pathway are active, also the pathway itself is supposed to have high activity. This chain of assumptions violates several well-investigated biological principles. Indeed, several studies have shown that mRNA level can explain only ~40% of the variation in protein expression (Greenbaum, Colangelo, Williams, & Gerstein, 2003; Ideker et al., 2001; Sousa Abreu, Penalva, Marcotte, & Vogel, 2009; Washburn et al., 2003), though this correlation is higher for genes that are differentially expressed and thus under strong regulation (Koussounadis, Langdon, Um, Harrison, & Smith, 2015). Moreover, the activity of proteins is often rather determined by post-translational modifications than the abundance (Mann & Jensen, 2003). Regardless of those weaknesses and limitations, pathway analysis with gene sets of pathway members yields reasonable results and is widely used (Huang et al., 2009; Khatri, Sirota, & Butte, 2012; Krämer, Green, Pollard, & Tugendreich, 2014; Nguyen, Shafi, Nguyen, & Draghici, 2019; Tarca et al., 2009). A recent study indicates that this approach is effective because gene set members are regulated by a common regulator so that the pathway activity informs actually about the activity of the regulator (Szalai & Saez-Rodriguez, 2020). These common regulators are typically transcription factors, which serve as another class of biological meaning features, whose activity promises a valuable readout of the cellular state. Following the idea of classical pathway analysis, the activity of transcription factors could be inferred simply by their expression. Interestingly, this approach is rarely used, even though it violates the same principles. Instead, observing the expression of the transcriptional targets of a transcription factor yields a much more robust estimation of the transcription factor activity (Alvarez et al., 2016; Essaghir et al., 2010; Garcia-Alonso et al., 2018, 2019; Keenan et al., 2019; Kwon, Arenillas, Worsley Hunt, & Wasserman, 2012; Puente-Santamaria, Wasserman, & Del Peso, 2019; Roopra, 2020; Zhenjia Wang et al., 2018). Hence, the gene sets used to infer transcription factor activity is composed of downstream target genes, i.e. regulons. Those regulatory networks can be reconstructed in many ways, ranging from wet-lab techniques to pure in-silico generated networks and spanning multiple omics-technologies. In a recent study, networks derived from Chromatin Immunoprecipitation Sequencing (ChIP-seq) data, transcription factor binding sites, literature reviews, and gene expression data were integrated into a single consensus network referred to as DoRothEA (Garcia-Alonso et al., 2019).

Observing the downstream effects of a biological process to gain functional and mechanistic insight into the upstream event is referred to as footprint analysis (Dugourd & Saez-Rodriguez, 2019). This concept is not exclusively limited to transcription factors. Intuitively, it can be easily transferred to estimate also kinase activity from phosphoproteomics data by exploiting the abundance of phosphorylated sites of kinase targets (Hernandez-Armenta, Ochoa, Gonçalves, Saez-Rodriguez, & Beltrao, 2017; Wiredja, Koyutürk, & Chance, 2017). However, this footprint concept can also be applied to biological processes that only have an indirect effect on e.g. gene expression, such as signaling pathways. This idea led to a novel way and perspective of predicting pathway activities from gene expression data. Instead of observing the expression of pathway members, the expression of the downstream affected genes is considered. The first large-scale tools that followed this principle are SPEED(2) and PROGENy (Parikh, Klinger, Xia, Marto, & Blüthgen, 2010; Rydenfelt, Klinger, Klünemann, & Blüthgen, 2020; Schubert et al., 2018). The limiting step of these methods is the number of pathways in the respective model. For each pathway separately, the downstream affected genes must be identified. The identification strategy from SPEED(2), as well as PROGENy, relies on the manual curation of pathway perturbation experiments with corresponding expression profiles. The footprint-based pathway analysis approach answers a different question than classical pathway analysis. The latter tries to explain the consequences of the measured expression pattern while footprint-based tools aim to identify the cause yielding the measured expression pattern (Szalai & Saez-Rodriguez, 2020).

Most gene sets are an unweighted collection of individual genes. However, it is also possible to assign weights to each gene set member, which opens up new avenues for how those gene sets could be analyzed. In terms of transcription factor analysis, the assigned weight could denote the mode of regulation, i.e. whether a transcription factor activates or suppresses the expression of its target gene. Similar to the footprint-based pathway analysis this weight could indicate the strength and direction of regulation upon pathway perturbation.

1.3.3 Different types of statistics to analyze gene sets

The available number of statistics to analyze gene sets together with transcriptomics data is comparable to the various types and sources of gene sets. The first generation of statistics tests whether gene set members are statistically over-represented in a list of differentially expressed genes, and is therefore referred to as over-representation analysis (ORA). Most commonly the test is based on the hypergeometric distribution known as the Fisher exact test. If gene set members are significantly over-represented in a list of differentially expressed genes it is assumed that the functional feature of the gene set is relevant for the underlying biological context. This strategy implies determining a cutoff classifying genes as differentially expressed. For instance, a gene can be considered differentially expressed if it passes a false discovery rate (FDR) \(\le\) 0.05 and an absolute log-fold change (logFC) \(\ge\) 1. However, there is no objective legitimation for those exact values, so that any other combination could be used as well, and genes that just do not pass the chosen threshold won’t be considered at all, even if they are just marginally below the thresholds. Consequently, the arbitrary selection of cutoffs directly impacts the results of ORA. Besides this limitation, ORA also treats the gene sets as an unweighted collection and thus equally, even though the degree and strength of regulation, depicted as significance and effect size, could be useful features to weight the individual gene set members.

The second generation of statistics tries to overcome those limitations and is referred to as functional class scoring (FCS) (Khatri et al., 2012). As opposed to ORA, where only the top differentially expressed genes are considered, FCS takes all genes (i.e. gene signature), irrespective of their strength of regulation, into account. Still, FCS builds on ORA as it acknowledges that if gene set members are strongly differentially expressed this has a significant functional effect. However, additionally, it is assumed that if gene set members are less strongly deregulated but in a coordinated manner this is still functionally relevant.

Gene set enrichment analysis (GSEA) is the most popular and widely used statistic out of the FCS generation. To detect whether gene sets are functionally relevant GSEA first ranks gene signatures derived from transcriptomic studies, based on a gene-level statistic, which can be any quantitative metric that is assigned per gene. Typically, log fold-changes, t-statistic, or even p-values serve as gene-level statistics. Subsequently, GSEA tests whether a gene set is significantly enriched at the top or the bottom of the list, indicating whether the functional feature of the gene set is increased or depleted in the given biological context. The original implementation is a rank-based approach, based on the Kolmogorov-Smirnov statistic. Next to GSEA and similar statistics, also general-purpose statistics as simple as z-score transformation, sum, or arithmetic mean could be applied to analyze gene sets operating on the chosen gene-level statistic. In the case of weighted gene sets also more complex approaches such as various types of linear models could be applied (Schubert et al., 2018; Trescher, Münchmeyer, & Leser, 2017).

So far those described statistics from the first and second generation operate on either a subset or entire gene signatures, which are typically the result of differential expression analysis of a case-control study. However, there are also methods that have been developed for the single-sample analysis such as ssGSEA, GSVA, PLAGE, or singscore (Barbie et al., 2009; Foroutan et al., 2018; Hänzelmann, Castelo, & Guinney, 2013; Lee, Chuang, Kim, Ideker, & Lee, 2008; Tomfohr, Lu, & Kepler, 2005). Those methods make gene set analysis also applicable to studies that do not follow the case-control design.

There have been attempts to combine a set of different statistics from the first and second generation to generate consensus functional analysis results (Väremo, Nielsen, & Nookaew, 2013).

Specifically for classical pathway analysis, an even third generation of statistics has been developed. As mentioned above classical pathway analysis is based on gene sets containing pathway members. However, both ORA and FCS ignore the functional relationship among pathway members. Especially the topology of a pathway has been neglected so far, even though this information is easily accessible in numerous databases. Henceforth, methods have been proposed that incorporate also pathway topology (Draghici et al., 2007; Hidalgo et al., 2017; Salviato, Djordjilović, Chiogna, & Romualdi, 2019; Tarca et al., 2009). Those methods assume that the position of a gene within a pathway is a meaningful feature, such as that upstream pathway members might have a larger influence on pathway activities than more downstream members, respective members without any downstream connection.

In summary, the suite of different approaches to functionally analyze transcriptome data can be applied to decipher key mechanisms of diseases and their progression. In my thesis, I focussed on liver-related diseases and disorders.

1.4 Chronic liver diseases

1.4.1 Structure of the liver

The liver is the largest solid organ in the human body comprising 2% of the body weight under healthy conditions. Among its primary functions is the metabolism of macromolecules such as fats, proteins, and carbohydrates to retain metabolic homeostasis. Accordingly, the liver also stores and redistributes nutrients. On a molecular level, the liver tissue is organized as hexagon-shaped hepatic lobules. Hepatocytes, which serve as the functional cells of a liver (“the liver cells”), constitute the largest part of those lobules and are circularly arranged around the lobule center. At each of the corners of the lobules, there is a distinctive structure consisting of branches of the portal vein, the hepatic artery, and the bile duct. Through the portal vein, hepatocytes are supplied with nutrients coming from the spleen, stomach, and intestines. This supply constitutes around 75% of the liver’s blood supply. The remaining 25% of the blood supply is delivered by the hepatic artery to serve hepatocytes with oxygen. The bile duct carries bile that is secreted by hepatocytes into the gallbladder (Boyer, 2013). The nutrient as well as the oxygen-rich blood flows to the center of the hepatic lobules and thereby distributes the nutrients and oxygens among the cells via the liver sinusoids. Finally, the nutrient and oxygen-poor blood reach the central vein from where it is transported to the hepatic vein that leads the blood back to the heart. Through the blood supply of the portal vein, the liver is continuously exposed to gut bacteria and associated endotoxins. Those particles are eliminated through phagocytosis by specialized macrophages, so-called Kupffer cells, which serve as another basic cell type of the liver. These Kupffer cells are part of the innate immune system and reside in the lumen of the sinusoids while being attached to the sinusoidal endothelial cells. Furthermore, the liver also contains hepatic stellate cells (HSC), which are liver-specific mesenchymal cells. They are located in the perisinusoidal space and store lipids. Under healthy conditions, they represent only 5-8% of all liver cells and are situated in a quiescent state (Blouin, Bolender, & Weibel, 1977).

1.4.2 Liver damage and repair

Like any other organ, the liver can take damage for various reasons. From a histological perspective, liver damage is reflected by necrotic and apoptotic hepatocytes. HSCs are pivotal for the wound healing response. Following liver damage, they get activated, proliferate, and start to synthesize extracellular matrix (ECM). In case of a minor or a single injury, ECM is deposited in and around the wound, which helps regenerate functional liver tissue by the proliferation of hepatocytes. However, if there is major damage ECM starts to accumulate which leads to scars on the liver. For repetitive damage, ECM continues to accumulate, and thus replaces functional liver tissue leading to the disruption of the tissue architecture. This scaring process is referred to as fibrosis but it is not exclusive to the liver. In fact, fibrosis can affect any organ in the body such as renal, pulmonary, or cardiac fibrosis (Henderson, Rieder, & Wynn, 2020). It is estimated that fibrosis is responsible for 45% of all deaths in the industrialized world. If the underlying cause of the liver damage is not removed, over the years more and more functional tissue will be replaced by ECM. This process can take any time from 5 up to 50 years but ultimately leads to the loss of function (Pellicoro, Ramachandran, Iredale, & Fallowfield, 2014). This disease stage is referred to as cirrhosis and most patients suffering from it require liver transplantation. Otherwise, they will develop hepatocellular carcinoma (HCC), which is the third most common cause of cancer-related deaths worldwide and has an estimated incidence of more than 1,000,000 by 2025 (F. Bray et al., 2018; Llovet et al., 2021).

1.4.3 Etiologies of chronic liver diseases

Disorders that lead to repetitive liver damage are referred to as chronic liver diseases (CLDs) and can have manifold etiologies. In the past chronic liver injury was particularly induced by viral infections such as hepatitis C. In 1980 this virus was discovered and the first blood tests for the virus detected were established. These efforts were led by the scientists Harvey J. Alter, Charles M. Rice, and Michael Houghton who ultimately got awarded the medicine Nobel prize in 2020 for their research. Nowadays, there exist effective, yet expensive therapies for hepatitis. Therefore virus infections remain only a minor cause for CLD in the industrialized world, though, this is still a severe issue in developing countries.

Nevertheless, the number of chronic liver disease cases is increasing in the western world. This is partly due to the changing lifestyle with unlimited access to unhealthy food. Super nutrition leads finally to obesity, which goes along with several severe health risks. In terms of the liver, obesity leads to the massive accumulation of fat in the liver which is referred to as non-alcoholic fatty liver disease (NAFLD). Partially, NAFLD progresses to non-alcoholic steatohepatitis (NASH), which involves continuous damage of the liver tissue by inflammatory processes. Other etiologies are massive alcohol abuse, auto-immune disorders as well as metabolic diseases such as diabetes. If the underlying cause of CLD is removed even a cirrhotic liver has the capability to repair itself (Pellicoro et al., 2014).

1.5 Thesis overview and aims

The incidence of chronic liver diseases and hepatocellular carcinoma is continuously increasing. Therefore, scientists around the world are trying to decipher the molecular mechanisms to ultimately develop therapeutic options. Is it obvious that a single branch of biology or medicine cannot accomplish this goal alone. Instead, multiple disciplines must come together. During my Ph.D., I aimed to contribute to these efforts by analyzing transcriptomics data of liver diseases. Next to the classical analyses on gene level, the focus was in particular on the further development and application of the transcription factor and pathway analysis tools DoRothEA and PROGENy. On my journey I completed the following milestones:

Benchmarking the transcription factor and pathway analysis tools DoRothEA and PROGENy for their application in mice (Chapter 2).
Testing the robustness and applicability of the transcription factor and pathway analysis tools DoRothEA and PROGENy in single-cell RNA-sequencing data (Chapter 3).
Analysis and functional characterization of acute and chronic liver disease transcriptomic data in mice and humans (Chapter 4).