Pseudo-bulk functional analysis

When cell lineage is clear (there are clear cell identity clusters), it might be beneficial to perform functional analyses at the pseudo-bulk level instead of the single-cell. By doing so, we recover lowly expressed genes that before where affected by the “drop-out” effect of single-cell. Additionaly, if there is more than one condition in our data, we can perform differential expression analysis (DEA) and use the gene statistics as input for enrichment analysis.

In this notebook we showcase how to use decoupler for pathway and transcription factor (TF) enrichment from a human data-set. The data consists of ~5k Blood myeloid cells from healthy and COVID-19 infected patients available in the Single Cell Expression Atlas here.

Note

This tutorial assumes that you already know the basics of decoupler. Else, check out the Usage tutorial first.

Loading packages

First, we need to load the relevant packages, scanpy to handle scRNA-seq data and decoupler to use statistical methods.

[1]:
import scanpy as sc
import decoupler as dc

# Only needed for processing
import numpy as np
import pandas as pd

# Needed for some plotting
import matplotlib.pyplot as plt

# Plotting options, change to your liking
sc.settings.set_figure_params(dpi=200, frameon=False)
sc.set_figure_params(dpi=200)
sc.set_figure_params(figsize=(4, 4))

Loading the data

We can download the data easily using scanpy:

[2]:
# Download data-set
adata = sc.datasets.ebi_expression_atlas("E-MTAB-9221", filter_boring=True)

# Rename meta-data
columns = ['Sample Characteristic[sex]',
           'Sample Characteristic[individual]',
           'Sample Characteristic[disease]',
           'Factor Value[inferred cell type - ontology labels]']
adata.obs = adata.obs[columns]
adata.obs.columns = ['sex','individual','disease','cell_type']
adata
[2]:
AnnData object with n_obs × n_vars = 6178 × 18958
    obs: 'sex', 'individual', 'disease', 'cell_type'

Processing

This specific data-set contains ensmbl gene ids instead of gene symbols. To be able to use decoupler we need to transform them into gene symbols:

[3]:
# Retrieve gene symbols
annot = sc.queries.biomart_annotations("hsapiens",
        ["ensembl_gene_id", "external_gene_name"],
        use_cache=False
    ).set_index("ensembl_gene_id")

# Filter genes not in annotation
adata = adata[:, adata.var.index.intersection(annot.index)]

# Assign gene symbols
adata.var['gene_symbol'] = [annot.loc[ensembl_id,'external_gene_name'] for ensembl_id in adata.var.index]
adata.var = adata.var.reset_index().rename(columns={'index': 'ensembl_gene_id'}).set_index('gene_symbol')

# Remove genes with no gene symbol
adata = adata[:, ~pd.isnull(adata.var.index)]

# Remove duplicates
adata.var_names_make_unique()

Since the meta-data of this data-set is available, we can filter cells that were not annotated:

[4]:
# Remove non-annotated cells
adata = adata[~adata.obs['cell_type'].isnull()]

We will store the raw counts in the .layers attribute so that we can use them afterwards to generate pseudo-bulk profiles.

[5]:
# Store raw counts in layers
adata.X = np.round(adata.X)
adata.layers['counts'] = adata.X

# Normalize and log-transform
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.layers['normalized'] = adata.X

We can also look how cells cluster by cell identity:

[6]:
# Identify highly variable genes
sc.pp.highly_variable_genes(adata, batch_key='individual')

# Scale the data
sc.pp.scale(adata, max_value=10)

# Generate PCA features
sc.tl.pca(adata, svd_solver='arpack', use_highly_variable=True)

# Compute distances in the PCA space, and find cell neighbors
sc.pp.neighbors(adata)

# Generate UMAP features
sc.tl.umap(adata)

# Visualize
sc.pl.umap(adata, color=['disease','cell_type'], frameon=False)
../_images/notebooks_pseudobulk_13_0.png

In this data-set we have two condition, COVID-19 and healthy, across 6 different cell types.

Generation of pseudo-bulk profiles

After the annotation of clusters into cell identities, we often would like to perform differential expression analysis (DEA) between conditions within particular cell types to further characterize them. DEA can be performed at the single-cell level, but the obtained p-values are often inflated as each cell is treated as a sample. We know that single cells within a sample are not independent of each other, since they were isolated from the same environment. If we treat cells as samples, we are not testing the variation across a population of samples, rather the variation inside an individual one. Moreover, if a sample has more cells than another it might bias the results.

The current best practice to correct for this is using a pseudo-bulk approach (Squair J.W., et al 2021), which involves the following steps:

  1. Subsetting the cell type(s) of interest to perform DEA.

  2. Extracting their raw integer counts.

  3. Summing their counts per gene into a single profile if they pass quality control.

  4. Performing DEA if at least two biological replicates per condition are available (more replicates are recommended).

We can pseudobulk using the function decoupler.get_pseudobulk. In this example, we are interested in summing the counts but other modes are available, for more information check its argument mode.

[7]:
# Get pseudo-bulk profile
pdata = dc.get_pseudobulk(
    adata,
    sample_col='individual',
    groups_col='cell_type',
    layer='counts',
    mode='sum',
    min_cells=0,
    min_counts=0
)

It has generated a profile for each sample and cell type. We can plot their quality control metrics:

[8]:
dc.plot_psbulk_samples(pdata, groupby=['individual', 'cell_type'], figsize=(12, 4))
../_images/notebooks_pseudobulk_18_0.png

There are two criteria to filter low quality samples: its number of cells (psbulk_n_cells), and its total sum of counts (psbulk_counts). In these plots it can be seen that there are some samples of platelet cells that contain less than 10 cells, we might want to remove them by using the arguments min_cells and min_counts. Note that these thresholds are arbitrary and will change depening on the dataset, but a good rule of thumb is to have at least 10 cells with 1000 accumulated counts.

[9]:
# Get filtered pseudo-bulk profile
pdata = dc.get_pseudobulk(
    adata,
    sample_col='individual',
    groups_col='cell_type',
    layer='counts',
    mode='sum',
    min_cells=10,
    min_counts=1000
)
pdata
[9]:
AnnData object with n_obs × n_vars = 30 × 17454
    obs: 'sex', 'individual', 'disease', 'cell_type', 'psbulk_n_cells', 'psbulk_counts'
    var: 'ensembl_gene_id', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'
    layers: 'psbulk_props'

Exploration of pseudobulk profiles

Now that we have generated the pseudobulk profiles for each patient and each cell type, let’s explore the variability between them. For that, we will first do some simple preprocessing and then do a PCA

[10]:
# Store raw counts in layers
pdata.layers['counts'] = pdata.X.copy()

# Normalize, scale and compute pca
sc.pp.normalize_total(pdata, target_sum=1e4)
sc.pp.log1p(pdata)
sc.pp.scale(pdata, max_value=10)
sc.tl.pca(pdata)

# Return raw counts to X
dc.swap_layer(pdata, 'counts', X_layer_key=None, inplace=True)
[11]:
sc.pl.pca(pdata, color=['disease', 'cell_type'], ncols=1, size=300)
sc.pl.pca_variance_ratio(pdata)
../_images/notebooks_pseudobulk_23_0.png
../_images/notebooks_pseudobulk_23_1.png

When looking at the PCA, it seems like the two first components explain most of the variance and they easily separate cell types from one another. In contrast, the principle components do not seem to be associated with disease status as such.

In order to have a better overview of the association of PCs with sample metadata, let’s perform ANOVA on each PC and see whether they are significantly associated with any technical or biological annotations of our samples

[12]:
dc.get_metadata_associations(
    pdata,
    obs_keys = ['sex', 'disease', 'cell_type', 'psbulk_n_cells', 'psbulk_counts'],  # Metadata columns to associate to PCs
    obsm_key='X_pca',  # Where the PCs are stored
    uns_key='pca_anova',  # Where the results are stored
    inplace=True,
)
[13]:
dc.plot_associations(
    pdata,
    uns_key='pca_anova',  # Summary statistics from the anova tests
    obsm_key='X_pca',  # where the PCs are stored
    stat_col='p_adj',  # Which summary statistic to plot
    obs_annotation_cols = ['disease', 'cell_type'], # which sample annotations to plot
    titles=['Principle component scores', 'Adjusted p-values from ANOVA'],
    figsize=(7, 5),
    n_factors=10,
)
../_images/notebooks_pseudobulk_26_0.png

On the PCA plots above, T and B cells seemed not to be that well separated. However when looking at the hierarchical clustering in the heatmap, one can see that the inclusion of more PCs helps to distinguish them.

When looking at the p-values from the ANOVA models, it becomes clear that the top PCs, which explain most of the observed variability, are statistically associated with the cell_type category.

Pseudo-bulk profile gene filtering

Additionally to filtering low quality samples, we can also filter noisy expressed genes in case we want to perform downstream analyses such as DEA afterwards. Note that this step should be done at the cell type level, since each cell type may express different collection of genes.

For this vignette, we will explore the effects of COVID on T cells. Let’s first select our samples of interest:

[14]:
# Select T cell profiles
tcells = pdata[pdata.obs['cell_type'] == 'T cell'].copy()

To filter genes, we will follow the strategy implemented in the function filterByExpr from edgeR. It keeps genes that have a minimum total number of reads across samples (min_total_count), and that have a minimum number of counts in a number of samples (min_count).

We can plot how many genes do we keep, you can play with the min_count and min_total_count to check how many genes would be kept when changed:

[15]:
dc.plot_filter_by_expr(tcells, group='disease', min_count=10, min_total_count=15)
../_images/notebooks_pseudobulk_31_0.png

Here we can observe the frequency of genes with different total sum of counts and number of samples. The dashed lines indicate the current thresholds, meaning that only the genes in the upper-right corner are going to be kept. Filtering parameters is completely arbitrary, but a good rule of thumb is to identify bimodal distributions and split them modifying the available thresholds. In this example, with the default values we would keep a good quantity of genes while filtering potential noisy genes.

Note

Changing the value of min_count will drastically change the distribution of “Number of samples”, not change its threshold. In case you want to lower or increase it, you need to play with the group, large_n and min_prop parameters.

Once we are content with the threshold parameters, we can perform the actual filtering:

[16]:
# Obtain genes that pass the thresholds
genes = dc.filter_by_expr(tcells, group='disease', min_count=10, min_total_count=15)

# Filter by these genes
tcells = tcells[:, genes].copy()
tcells
[16]:
AnnData object with n_obs × n_vars = 7 × 10270
    obs: 'sex', 'individual', 'disease', 'cell_type', 'psbulk_n_cells', 'psbulk_counts'
    var: 'ensembl_gene_id', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'
    uns: 'log1p', 'pca', 'disease_colors', 'cell_type_colors', 'pca_anova'
    obsm: 'X_pca'
    varm: 'PCs'
    layers: 'psbulk_props', 'counts'

Another filtering strategy is to filter out genes that are not expressed in a percentage of cells and samples, as implemented in decoupler.filter_by_prop.

Contrast between conditions

Once we have generated robust pseudo-bulk profiles, we can compute DEA. For this example, we will perform a simple experimental design where we compare the gene expression of T cells from diseased patients against controls. We will use the python implementation of the framework DESeq2, but we could have used any other one (limma or edgeR for example). For a better understanding how it works, check DESeq2’s documentation. Note that more complex experimental designs can be used by adding more factors to the design_factors argument.

[17]:
# Import DESeq2
from pydeseq2.dds import DeseqDataSet, DefaultInference
from pydeseq2.ds import DeseqStats
[18]:
# Build DESeq2 object
inference = DefaultInference(n_cpus=8)
dds = DeseqDataSet(
    adata=tcells,
    design_factors='disease',
    ref_level=['disease', 'normal'],
    refit_cooks=True,
    inference=inference,
)
[19]:
# Compute LFCs
dds.deseq2()
Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 19.01 seconds.

Fitting dispersion trend curve...
... done in 0.56 seconds.

Fitting MAP dispersions...
... done in 18.73 seconds.

Fitting LFCs...
... done in 1.84 seconds.

Refitting 0 outliers.

[20]:
# Extract contrast between COVID-19 vs normal
stat_res = DeseqStats(
    dds,
    contrast=["disease", 'COVID-19', 'normal'],
    inference=inference,
)
[21]:
# Compute Wald test
stat_res.summary()
Running Wald tests...
Log2 fold change & Wald test p-value: disease COVID-19 vs normal
               baseMean  log2FoldChange     lfcSE      stat    pvalue  \
gene_symbol
A1BG          76.986992       -0.152824  0.130998 -1.166608  0.243369
A2M           36.820522       -1.264448  0.329964 -3.832084  0.000127
A2MP1         16.820137        0.434752  1.012762  0.429274  0.667724
AAAS          19.124537        0.224567  0.374654  0.599398  0.548907
AACS          27.159739        0.349969  0.328019  1.066916  0.286010
...                 ...             ...       ...       ...       ...
ZXDC          30.407370       -0.319429  0.324120 -0.985527  0.324365
ZYG11B       111.208046        0.236943  0.287587  0.823900  0.409996
ZYX           83.019058        0.318465  0.241356  1.319482  0.187008
ZZEF1        904.297363        0.018461  0.235610  0.078353  0.937547
ZZZ3          60.278961       -0.080888  0.273754 -0.295476  0.767630

                 padj
gene_symbol
A1BG         0.607021
A2M          0.007166
A2MP1        0.882245
AAAS         0.826906
AACS         0.646387
...               ...
ZXDC         0.676757
ZYG11B       0.744337
ZYX          0.542884
ZZEF1        0.983740
ZZZ3         0.922296

[10270 rows x 6 columns]
... done in 0.77 seconds.

[22]:
# Extract results
results_df = stat_res.results_df
results_df
[22]:
baseMean log2FoldChange lfcSE stat pvalue padj
gene_symbol
A1BG 76.986992 -0.152824 0.130998 -1.166608 0.243369 0.607021
A2M 36.820522 -1.264448 0.329964 -3.832084 0.000127 0.007166
A2MP1 16.820137 0.434752 1.012762 0.429274 0.667724 0.882245
AAAS 19.124537 0.224567 0.374654 0.599398 0.548907 0.826906
AACS 27.159739 0.349969 0.328019 1.066916 0.286010 0.646387
... ... ... ... ... ... ...
ZXDC 30.407370 -0.319429 0.324120 -0.985527 0.324365 0.676757
ZYG11B 111.208046 0.236943 0.287587 0.823900 0.409996 0.744337
ZYX 83.019058 0.318465 0.241356 1.319482 0.187008 0.542884
ZZEF1 904.297363 0.018461 0.235610 0.078353 0.937547 0.983740
ZZZ3 60.278961 -0.080888 0.273754 -0.295476 0.767630 0.922296

10270 rows × 6 columns

We can plot the obtained results in a volcano plot:

[23]:
dc.plot_volcano_df(
    results_df,
    x='log2FoldChange',
    y='padj',
    top=20,
    figsize=(8, 4)
)
../_images/notebooks_pseudobulk_43_0.png

After performing DEA, we can use the obtained gene level statistics to perform enrichment analysis. Any statistic can be used, but we recommend using the t-values instead of logFCs since t-values incorporate the significance of change in their value. We will transform the obtained t-values stored in stats to a wide matrix so that it can be used by decoupler:

[24]:
mat = results_df[['stat']].T.rename(index={'stat': 'T cell'})
mat
[24]:
gene_symbol A1BG A2M A2MP1 AAAS AACS AAGAB AAK1 AAMDC AAMP AAR2 ... ZUP1 ZWILCH ZWINT ZXDA ZXDB ZXDC ZYG11B ZYX ZZEF1 ZZZ3
T cell -1.166608 -3.832084 0.429274 0.599398 1.066916 -0.019375 -0.632058 0.259132 1.467107 -0.454862 ... 1.559071 1.360735 3.547976 -0.786136 -3.314158 -0.985527 0.8239 1.319482 0.078353 -0.295476

1 rows × 10270 columns

Transcription factor activity inference

The first functional analysis we can perform is to infer transcription factor (TF) activities from our transcriptomics data. We will need a gene regulatory network (GRN) and a statistical method.

CollecTRI network

CollecTRI is a comprehensive resource containing a curated collection of TFs and their transcriptional targets compiled from 12 different resources. This collection provides an increased coverage of transcription factors and a superior performance in identifying perturbed TFs compared to our previous DoRothEA network and other literature based GRNs. Similar to DoRothEA, interactions are weighted by their mode of regulation (activation or inhibition).

For this example we will use the human version (mouse and rat are also available). We can use decoupler to retrieve it from omnipath. The argument split_complexes keeps complexes or splits them into subunits, by default we recommend to keep complexes together.

Note

In this tutorial we use the network CollecTRI, but we could use any other GRN coming from an inference method such as CellOracle, pySCENIC or SCENIC+.

[25]:
# Retrieve CollecTRI gene regulatory network
collectri = dc.get_collectri(organism='human', split_complexes=False)
collectri
[25]:
source target weight PMID
0 MYC TERT 1 10022128;10491298;10606235;10637317;10723141;1...
1 SPI1 BGLAP 1 10022617
2 SMAD3 JUN 1 10022869;12374795
3 SMAD4 JUN 1 10022869;12374795
4 STAT5A IL2 1 10022878;11435608;17182565;17911616;22854263;2...
... ... ... ... ...
43173 NFKB hsa-miR-143-3p 1 19472311
43174 AP1 hsa-miR-206 1 19721712
43175 NFKB hsa-miR-21-5p 1 20813833;22387281
43176 NFKB hsa-miR-224-5p 1 23474441;23988648
43177 AP1 hsa-miR-144 1 23546882

43178 rows × 4 columns

Activity inference with Univariate Linear Model (ULM)

To infer TF enrichment scores we will run the Univariate Linear Model (ulm) method. For each sample in our dataset (mat) and each TF in our network (net), it fits a linear model that predicts the observed gene expression based solely on the TF’s TF-Gene interaction weights. Once fitted, the obtained t-value of the slope is the score. If it is positive, we interpret that the TF is active and if it is negative we interpret that it is inactive.

e848fcd626d9485f95bac42b9f50cb57

We can run ulm with a one-liner:

[26]:
# Infer pathway activities with ulm
tf_acts, tf_pvals = dc.run_ulm(mat=mat, net=collectri)
tf_acts
[26]:
ABL1 AHR AIRE AP1 APEX1 AR ARID3B ARID4A ARID5B ARNT ... ZNF350 ZNF354C ZNF362 ZNF382 ZNF384 ZNF395 ZNF436 ZNF699 ZNF76 ZNF91
T cell 0.665785 1.382473 -1.763323 -0.864442 0.057612 0.100369 -0.590489 -0.141929 1.003913 -0.455601 ... 0.651418 -1.856305 1.019936 0.053302 -1.483298 1.173904 0.410554 -0.319962 1.88082 -1.32186

1 rows × 547 columns

Let us plot the obtained scores for the top active/inactive transcription factors:

[27]:
dc.plot_barplot(
    acts=tf_acts,
    contrast='T cell',
    top=25,
    vertical=True,
    figsize=(3, 6)
)
../_images/notebooks_pseudobulk_51_0.png

In accordance to the previous pathway results, T cells seem to activate for TFs responsible for cell growth (E2F4, TFDP1, E2F1).

Like with pathways, we can explore how the target genes look like:

[28]:
# Extract logFCs and pvals
logFCs = results_df[['log2FoldChange']].T.rename(index={'log2FoldChange': 'T cell'})
pvals = results_df[['padj']].T.rename(index={'padj': 'T cell'})

# Plot
dc.plot_volcano(
    logFCs=logFCs,
    pvals=pvals,
    contrast='T cell',
    name='E2F4',
    net=collectri,
    top=10,
    sign_thr=0.05,
    lFCs_thr=0.5
)
../_images/notebooks_pseudobulk_54_0.png

We can also plot the network of interesting TFs (top and bottom by activity) and color the nodes by activity and target gene expression:

[29]:
dc.plot_network(
    net=collectri,
    obs=mat,
    act=tf_acts,
    n_sources=['MYC', 'E2F4', 'HSF1', 'GATA6'],
    n_targets=15,
    node_size=100,
    figsize=(7, 7),
    c_pos_w='darkgreen',
    c_neg_w='darkred',
    vcenter=True
)
../_images/notebooks_pseudobulk_56_0.png

Green edges are positive regulation (activation), red edges are negative regulation (inactivation):

Pathway activity inference

Another analysis we can perform is to infer pathway activities from our transcriptomics data.

PROGENy model

PROGENy is a comprehensive resource containing a curated collection of pathways and their target genes, with weights for each interaction. For this example we will use the human weights (other organisms are available) and we will use the top 500 responsive genes ranked by p-value. Here is a brief description of each pathway:

  • Androgen: involved in the growth and development of the male reproductive organs.

  • EGFR: regulates growth, survival, migration, apoptosis, proliferation, and differentiation in mammalian cells

  • Estrogen: promotes the growth and development of the female reproductive organs.

  • Hypoxia: promotes angiogenesis and metabolic reprogramming when O2 levels are low.

  • JAK-STAT: involved in immunity, cell division, cell death, and tumor formation.

  • MAPK: integrates external signals and promotes cell growth and proliferation.

  • NFkB: regulates immune response, cytokine production and cell survival.

  • p53: regulates cell cycle, apoptosis, DNA repair and tumor suppression.

  • PI3K: promotes growth and proliferation.

  • TGFb: involved in development, homeostasis, and repair of most tissues.

  • TNFa: mediates haematopoiesis, immune surveillance, tumour regression and protection from infection.

  • Trail: induces apoptosis.

  • VEGF: mediates angiogenesis, vascular permeability, and cell migration.

  • WNT: regulates organ morphogenesis during development and tissue repair.

To access it we can use decoupler.

[30]:
# Retrieve PROGENy model weights
progeny = dc.get_progeny(top=500)
progeny
[30]:
source target weight p_value
0 Androgen TMPRSS2 11.490631 0.000000e+00
1 Androgen NKX3-1 10.622551 2.242078e-44
2 Androgen MBOAT2 10.472733 4.624285e-44
3 Androgen KLK2 10.176186 1.944414e-40
4 Androgen SARG 11.386852 2.790209e-40
... ... ... ... ...
6995 p53 ZMYM4 -2.325752 1.522388e-06
6996 p53 CFDP1 -1.628168 1.526045e-06
6997 p53 VPS37D 2.309503 1.537098e-06
6998 p53 TEDC1 -2.274823 1.547037e-06
6999 p53 CCDC138 -3.205113 1.568160e-06

7000 rows × 4 columns

Activity inference with Multivariate Linear Model (MLM)

To infer pathway enrichment scores we will run the Multivariate Linear Model (mlm) method. For each sample in our dataset (adata), it fits a linear model that predicts the observed gene expression based on all pathways’ Pathway-Gene interactions weights. Once fitted, the obtained t-values of the slopes are the scores. If it is positive, we interpret that the pathway is active and if it is negative we interpret that it is inactive.

3b803ee790a04cba9f0bc4a2f7ae1c6c

We can run mlm with a one-liner:

[31]:
# Infer pathway activities with mlm
pathway_acts, pathway_pvals = dc.run_mlm(mat=mat, net=progeny)
pathway_acts
[31]:
Androgen EGFR Estrogen Hypoxia JAK-STAT MAPK NFkB PI3K TGFb TNFa Trail VEGF WNT p53
T cell -0.480535 1.286544 1.866836 -6.126904 9.053461 -1.218023 -6.120244 -0.572253 -2.101866 1.693417 0.167998 0.233807 0.458145 -1.721158

Let us plot the obtained scores:

[32]:
dc.plot_barplot(
    acts=pathway_acts,
    contrast='T cell',
    top=25,
    vertical=False,
    figsize=(6, 3)
)
../_images/notebooks_pseudobulk_63_0.png

It looks like JAK-STAT, a known immunity pathway is more active in T cells from COVID-19 patients than in controls. To further explore how the target genes of a pathway of interest behave, we can plot them in scatter plot:

[33]:
dc.plot_targets(
    data=results_df,
    stat='stat',
    source_name='JAK-STAT',
    net=progeny,
    top=15
)
../_images/notebooks_pseudobulk_65_0.png

The observed activation of JAK-STAT is due to the fact that majority of its target genes with positive weights have positive t-values (1st quadrant), and the majority of the ones with negative weights have negative t-values (3d quadrant).

Functional enrichment of biological terms

Finally, we can also infer activities for general biological terms or processes.

MSigDB gene sets

The Molecular Signatures Database (MSigDB) is a resource containing a collection of gene sets annotated to different biological processes.

[34]:
# Retrieve MSigDB resource
msigdb = dc.get_resource('MSigDB')
msigdb
[34]:
genesymbol collection geneset
0 MAFF chemical_and_genetic_perturbations BOYAULT_LIVER_CANCER_SUBCLASS_G56_DN
1 MAFF chemical_and_genetic_perturbations ELVIDGE_HYPOXIA_UP
2 MAFF chemical_and_genetic_perturbations NUYTTEN_NIPP1_TARGETS_DN
3 MAFF immunesigdb GSE17721_POLYIC_VS_GARDIQUIMOD_4H_BMDC_DN
4 MAFF chemical_and_genetic_perturbations SCHAEFFER_PROSTATE_DEVELOPMENT_12HR_UP
... ... ... ...
3838543 PRAMEF22 go_biological_process GOBP_POSITIVE_REGULATION_OF_CELL_POPULATION_PR...
3838544 PRAMEF22 go_biological_process GOBP_APOPTOTIC_PROCESS
3838545 PRAMEF22 go_biological_process GOBP_REGULATION_OF_CELL_DEATH
3838546 PRAMEF22 go_biological_process GOBP_NEGATIVE_REGULATION_OF_DEVELOPMENTAL_PROCESS
3838547 PRAMEF22 go_biological_process GOBP_NEGATIVE_REGULATION_OF_CELL_DEATH

3838548 rows × 3 columns

As an example, we will use the hallmark gene sets, but we could have used any other.

Note

To see what other collections are available in MSigDB, type: msigdb['collection'].unique().

We can filter by for hallmark:

[35]:
# Filter by hallmark
msigdb = msigdb[msigdb['collection']=='hallmark']

# Remove duplicated entries
msigdb = msigdb[~msigdb.duplicated(['geneset', 'genesymbol'])]

# Rename
msigdb.loc[:, 'geneset'] = [name.split('HALLMARK_')[1] for name in msigdb['geneset']]

msigdb
[35]:
genesymbol collection geneset
233 MAFF hallmark IL2_STAT5_SIGNALING
250 MAFF hallmark COAGULATION
270 MAFF hallmark HYPOXIA
373 MAFF hallmark TNFA_SIGNALING_VIA_NFKB
377 MAFF hallmark COMPLEMENT
... ... ... ...
1449668 STXBP1 hallmark PANCREAS_BETA_CELLS
1450315 ELP4 hallmark PANCREAS_BETA_CELLS
1450526 GCG hallmark PANCREAS_BETA_CELLS
1450731 PCSK2 hallmark PANCREAS_BETA_CELLS
1450916 PAX6 hallmark PANCREAS_BETA_CELLS

7318 rows × 3 columns

Enrichment with Over Representation Analysis (ORA)

To infer functional enrichment scores we will run the Over Representation Analysis (ora) method. As input data it accepts an expression matrix (decoupler.run_ora) or the results of differential expression analysis (decoupler.run_ora_df). For the former, by default the top 5% of expressed genes by sample are selected as the set of interest (S), and for the latter a user-defined significance filtering can be used. Once we have S, it builds a contingency table using set operations for each set stored in the gene set resource being used (net). Using the contingency table, ora performs a one-sided Fisher exact test to test for significance of overlap between sets. The final score is obtained by log-transforming the obtained p-values, meaning that higher values are more significant.

5bd72de0f56b44cfbaaaa06a0a3dd433

We can run ora with a simple one-liner:

[36]:
# Infer enrichment with ora using significant deg
top_genes = results_df[results_df['padj'] < 0.05]

# Run ora
enr_pvals = dc.get_ora_df(
    df=top_genes,
    net=msigdb,
    source='geneset',
    target='genesymbol'
)

enr_pvals.head()
[36]:
Term Set size Overlap ratio p-value FDR p-value Odds ratio Combined score Features
0 ADIPOGENESIS 200 0.030000 0.339486 0.462078 1.367224 1.477044 ACAA2;AIFM1;DECR1;NMT1;PPM1B;TALDO1
1 ALLOGRAFT_REJECTION 200 0.035000 0.199268 0.348719 1.580933 2.550212 CSK;FGR;GBP2;HLA-E;PTPN6;STAT4;TNF
2 ANDROGEN_RESPONSE 100 0.020000 0.690477 0.782571 1.045485 0.387219 HERC3;RPS6KA3
3 APICAL_JUNCTION 200 0.020000 0.702717 0.782571 0.942524 0.332524 ACTG1;CD276;CERCAM;SHC1
4 APICAL_SURFACE 44 0.022727 0.653115 0.782571 1.417684 0.603936 ADAM10

Then we can visualize the most enriched terms:

[37]:
dc.plot_dotplot(
    enr_pvals.sort_values('Combined score', ascending=False).head(15),
    x='Combined score',
    y='Term',
    s='Odds ratio',
    c='FDR p-value',
    scale=0.5,
    figsize=(3, 9)
)
../_images/notebooks_pseudobulk_74_0.png

We can also plot the running score for a given gene set:

[38]:
# Plot
dc.plot_running_score(
    df=results_df,
    stat='stat',
    net=msigdb,
    source='geneset',
    target='genesymbol',
    set_name='MYC_TARGETS_V1'
)
../_images/notebooks_pseudobulk_76_0.png