Pseudo-bulk functional analysis

When cell lineage is clear (there are clear cell identity clusters), it might be beneficial to perform functional analyses at the pseudo-bulk level instead of the single-cell. By doing so, we recover lowly expressed genes that before where affected by the “drop-out” effect of single-cell. Additionaly, if there is more than one condition in our data, we can perform differential expression analysis (DEA) and use the gene statistics as input for enrichment analysis.

In this notebook we showcase how to use decoupler for pathway and transcription factor (TF) enrichment from a human data-set. The data consists of ~5k Blood myeloid cells from healthy and COVID-19 infected patients available in the Single Cell Expression Atlas here.

Note

This tutorial assumes that you already know the basics of decoupler. Else, check out the Usage tutorial first.

Loading packages

First, we need to load the relevant packages, scanpy to handle scRNA-seq data and decoupler to use statistical methods.

[1]:

import scanpy as sc
import decoupler as dc

# Only needed for processing
import numpy as np
import pandas as pd

# Needed for some plotting
import matplotlib.pyplot as plt

# Plotting options, change to your liking
sc.settings.set_figure_params(dpi=200, frameon=False)
sc.set_figure_params(dpi=200)
sc.set_figure_params(figsize=(4, 4))

Loading the data

We can download the data easily using scanpy:

[2]:

# Download data-set
adata = sc.datasets.ebi_expression_atlas("E-MTAB-9221", filter_boring=True)

# Rename meta-data
columns = ['Sample Characteristic[sex]',
           'Sample Characteristic[individual]',
           'Sample Characteristic[disease]',
           'Factor Value[inferred cell type - ontology labels]']
adata.obs = adata.obs[columns]
adata.obs.columns = ['sex','individual','disease','cell_type']
adata

[2]:

AnnData object with n_obs × n_vars = 6178 × 18958
    obs: 'sex', 'individual', 'disease', 'cell_type'

Processing

This specific data-set contains ensmbl gene ids instead of gene symbols. To be able to use decoupler we need to transform them into gene symbols:

[3]:

# Retrieve gene symbols
annot = sc.queries.biomart_annotations("hsapiens",
        ["ensembl_gene_id", "external_gene_name"],
        use_cache=False
    ).set_index("ensembl_gene_id")

# Filter genes not in annotation
adata = adata[:, adata.var.index.intersection(annot.index)]

# Assign gene symbols
adata.var['gene_symbol'] = [annot.loc[ensembl_id,'external_gene_name'] for ensembl_id in adata.var.index]
adata.var = adata.var.reset_index().rename(columns={'index': 'ensembl_gene_id'}).set_index('gene_symbol')

# Remove genes with no gene symbol
adata = adata[:, ~pd.isnull(adata.var.index)]

# Remove duplicates
adata.var_names_make_unique()

Since the meta-data of this data-set is available, we can filter cells that were not annotated:

[4]:

# Remove non-annotated cells
adata = adata[~adata.obs['cell_type'].isnull()]

We will store the raw counts in the .layers attribute so that we can use them afterwards to generate pseudo-bulk profiles.

[5]:

# Store raw counts in layers
adata.X = np.round(adata.X)
adata.layers['counts'] = adata.X

# Normalize and log-transform
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.layers['normalized'] = adata.X

We can also look how cells cluster by cell identity:

[6]:

# Identify highly variable genes
sc.pp.highly_variable_genes(adata, batch_key='individual')

# Scale the data
sc.pp.scale(adata, max_value=10)

# Generate PCA features
sc.tl.pca(adata, svd_solver='arpack', use_highly_variable=True)

# Compute distances in the PCA space, and find cell neighbors
sc.pp.neighbors(adata)

# Generate UMAP features
sc.tl.umap(adata)

# Visualize
sc.pl.umap(adata, color=['disease','cell_type'], frameon=False)

../_images/notebooks_pseudobulk_13_0.png

In this data-set we have two condition, COVID-19 and healthy, across 6 different cell types.

Generation of pseudo-bulk profiles

After the annotation of clusters into cell identities, we often would like to perform differential expression analysis (DEA) between conditions within particular cell types to further characterize them. DEA can be performed at the single-cell level, but the obtained p-values are often inflated as each cell is treated as a sample. We know that single cells within a sample are not independent of each other, since they were isolated from the same environment. If we treat cells as samples, we are not testing the variation across a population of samples, rather the variation inside an individual one. Moreover, if a sample has more cells than another it might bias the results.

The current best practice to correct for this is using a pseudo-bulk approach (Squair J.W., et al 2021), which involves the following steps:

Subsetting the cell type(s) of interest to perform DEA.
Extracting their raw integer counts.
Summing their counts per gene into a single profile if they pass quality control.
Performing DEA if at least two biological replicates per condition are available (more replicates are recommended).

We can pseudobulk using the function decoupler.get_pseudobulk. In this example, we are interested in summing the counts but other modes are available, for more information check its argument mode.

[7]:

# Get pseudo-bulk profile
pdata = dc.get_pseudobulk(
    adata,
    sample_col='individual',
    groups_col='cell_type',
    layer='counts',
    mode='sum',
    min_cells=0,
    min_counts=0
)

It has generated a profile for each sample and cell type. We can plot their quality control metrics:

[8]:

dc.plot_psbulk_samples(pdata, groupby=['individual', 'cell_type'], figsize=(12, 4))

../_images/notebooks_pseudobulk_18_0.png

There are two criteria to filter low quality samples: its number of cells (psbulk_n_cells), and its total sum of counts (psbulk_counts). In these plots it can be seen that there are some samples of platelet cells that contain less than 10 cells, we might want to remove them by using the arguments min_cells and min_counts. Note that these thresholds are arbitrary and will change depening on the dataset, but a good rule of thumb is to have at least 10 cells with 1000 accumulated counts.

[9]:

# Get filtered pseudo-bulk profile
pdata = dc.get_pseudobulk(
    adata,
    sample_col='individual',
    groups_col='cell_type',
    layer='counts',
    mode='sum',
    min_cells=10,
    min_counts=1000
)
pdata

[9]:

AnnData object with n_obs × n_vars = 30 × 17454
    obs: 'sex', 'individual', 'disease', 'cell_type', 'psbulk_n_cells', 'psbulk_counts'
    var: 'ensembl_gene_id', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'
    layers: 'psbulk_props'

Exploration of pseudobulk profiles

Now that we have generated the pseudobulk profiles for each patient and each cell type, let’s explore the variability between them. For that, we will first do some simple preprocessing and then do a PCA

[10]:

# Store raw counts in layers
pdata.layers['counts'] = pdata.X.copy()

# Normalize, scale and compute pca
sc.pp.normalize_total(pdata, target_sum=1e4)
sc.pp.log1p(pdata)
sc.pp.scale(pdata, max_value=10)
sc.tl.pca(pdata)

# Return raw counts to X
dc.swap_layer(pdata, 'counts', X_layer_key=None, inplace=True)

[11]:

sc.pl.pca(pdata, color=['disease', 'cell_type'], ncols=1, size=300)
sc.pl.pca_variance_ratio(pdata)

../_images/notebooks_pseudobulk_23_0.png

../_images/notebooks_pseudobulk_23_1.png

When looking at the PCA, it seems like the two first components explain most of the variance and they easily separate cell types from one another. In contrast, the principle components do not seem to be associated with disease status as such.

In order to have a better overview of the association of PCs with sample metadata, let’s perform ANOVA on each PC and see whether they are significantly associated with any technical or biological annotations of our samples

[12]:

dc.get_metadata_associations(
    pdata,
    obs_keys = ['sex', 'disease', 'cell_type', 'psbulk_n_cells', 'psbulk_counts'],  # Metadata columns to associate to PCs
    obsm_key='X_pca',  # Where the PCs are stored
    uns_key='pca_anova',  # Where the results are stored
    inplace=True,
)

[13]:

dc.plot_associations(
    pdata,
    uns_key='pca_anova',  # Summary statistics from the anova tests
    obsm_key='X_pca',  # where the PCs are stored
    stat_col='p_adj',  # Which summary statistic to plot
    obs_annotation_cols = ['disease', 'cell_type'], # which sample annotations to plot
    titles=['Principle component scores', 'Adjusted p-values from ANOVA'],
    figsize=(7, 5),
    n_factors=10,
)

../_images/notebooks_pseudobulk_26_0.png

On the PCA plots above, T and B cells seemed not to be that well separated. However when looking at the hierarchical clustering in the heatmap, one can see that the inclusion of more PCs helps to distinguish them.

When looking at the p-values from the ANOVA models, it becomes clear that the top PCs, which explain most of the observed variability, are statistically associated with the cell_type category.

Pseudo-bulk profile gene filtering

Additionally to filtering low quality samples, we can also filter noisy expressed genes in case we want to perform downstream analyses such as DEA afterwards. Note that this step should be done at the cell type level, since each cell type may express different collection of genes.

For this vignette, we will explore the effects of COVID on T cells. Let’s first select our samples of interest:

[14]:

# Select T cell profiles
tcells = pdata[pdata.obs['cell_type'] == 'T cell'].copy()

To filter genes, we will follow the strategy implemented in the function filterByExpr from edgeR. It keeps genes that have a minimum total number of reads across samples (min_total_count), and that have a minimum number of counts in a number of samples (min_count).

We can plot how many genes do we keep, you can play with the min_count and min_total_count to check how many genes would be kept when changed:

[15]:

dc.plot_filter_by_expr(tcells, group='disease', min_count=10, min_total_count=15)

Here we can observe the frequency of genes with different total sum of counts and number of samples. The dashed lines indicate the current thresholds, meaning that only the genes in the upper-right corner are going to be kept. Filtering parameters is completely arbitrary, but a good rule of thumb is to identify bimodal distributions and split them modifying the available thresholds. In this example, with the default values we would keep a good quantity of genes while filtering potential noisy genes.

Note

Changing the value of min_count will drastically change the distribution of “Number of samples”, not change its threshold. In case you want to lower or increase it, you need to play with the group, large_n and min_prop parameters.

Once we are content with the threshold parameters, we can perform the actual filtering:

[16]:

# Obtain genes that pass the thresholds
genes = dc.filter_by_expr(tcells, group='disease', min_count=10, min_total_count=15)

# Filter by these genes
tcells = tcells[:, genes].copy()
tcells

[16]:

AnnData object with n_obs × n_vars = 7 × 10270
    obs: 'sex', 'individual', 'disease', 'cell_type', 'psbulk_n_cells', 'psbulk_counts'
    var: 'ensembl_gene_id', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'
    uns: 'log1p', 'pca', 'disease_colors', 'cell_type_colors', 'pca_anova'
    obsm: 'X_pca'
    varm: 'PCs'
    layers: 'psbulk_props', 'counts'

Another filtering strategy is to filter out genes that are not expressed in a percentage of cells and samples, as implemented in decoupler.filter_by_prop.

Contrast between conditions

Once we have generated robust pseudo-bulk profiles, we can compute DEA. For this example, we will perform a simple experimental design where we compare the gene expression of T cells from diseased patients against controls. We will use the python implementation of the framework DESeq2, but we could have used any other one (limma or edgeR for example). For a better understanding how it works, check DESeq2’s documentation. Note that more complex experimental designs can be used by adding more factors to the design_factors argument.

[17]:

# Import DESeq2
from pydeseq2.dds import DeseqDataSet, DefaultInference
from pydeseq2.ds import DeseqStats

[18]:

# Build DESeq2 object
inference = DefaultInference(n_cpus=8)
dds = DeseqDataSet(
    adata=tcells,
    design_factors='disease',
    ref_level=['disease', 'normal'],
    refit_cooks=True,
    inference=inference,
)

[19]:

# Compute LFCs
dds.deseq2()

Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 19.01 seconds.

Fitting dispersion trend curve...
... done in 0.56 seconds.

Fitting MAP dispersions...
... done in 18.73 seconds.

Fitting LFCs...
... done in 1.84 seconds.

Refitting 0 outliers.

[20]:

# Extract contrast between COVID-19 vs normal
stat_res = DeseqStats(
    dds,
    contrast=["disease", 'COVID-19', 'normal'],
    inference=inference,
)

[21]:

# Compute Wald test
stat_res.summary()

Running Wald tests...

Log2 fold change & Wald test p-value: disease COVID-19 vs normal
               baseMean  log2FoldChange     lfcSE      stat    pvalue  \
gene_symbol
A1BG          76.986992       -0.152824  0.130998 -1.166608  0.243369
A2M           36.820522       -1.264448  0.329964 -3.832084  0.000127
A2MP1         16.820137        0.434752  1.012762  0.429274  0.667724
AAAS          19.124537        0.224567  0.374654  0.599398  0.548907
AACS          27.159739        0.349969  0.328019  1.066916  0.286010
...                 ...             ...       ...       ...       ...
ZXDC          30.407370       -0.319429  0.324120 -0.985527  0.324365
ZYG11B       111.208046        0.236943  0.287587  0.823900  0.409996
ZYX           83.019058        0.318465  0.241356  1.319482  0.187008
ZZEF1        904.297363        0.018461  0.235610  0.078353  0.937547
ZZZ3          60.278961       -0.080888  0.273754 -0.295476  0.767630

                 padj
gene_symbol
A1BG         0.607021
A2M          0.007166
A2MP1        0.882245
AAAS         0.826906
AACS         0.646387
...               ...
ZXDC         0.676757
ZYG11B       0.744337
ZYX          0.542884
ZZEF1        0.983740
ZZZ3         0.922296

[10270 rows x 6 columns]

... done in 0.77 seconds.

[22]:

# Extract results
results_df = stat_res.results_df
results_df

[22]:

	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj
gene_symbol
A1BG	76.986992	-0.152824	0.130998	-1.166608	0.243369	0.607021
A2M	36.820522	-1.264448	0.329964	-3.832084	0.000127	0.007166
A2MP1	16.820137	0.434752	1.012762	0.429274	0.667724	0.882245
AAAS	19.124537	0.224567	0.374654	0.599398	0.548907	0.826906
AACS	27.159739	0.349969	0.328019	1.066916	0.286010	0.646387
...	...	...	...	...	...	...
ZXDC	30.407370	-0.319429	0.324120	-0.985527	0.324365	0.676757
ZYG11B	111.208046	0.236943	0.287587	0.823900	0.409996	0.744337
ZYX	83.019058	0.318465	0.241356	1.319482	0.187008	0.542884
ZZEF1	904.297363	0.018461	0.235610	0.078353	0.937547	0.983740
ZZZ3	60.278961	-0.080888	0.273754	-0.295476	0.767630	0.922296

10270 rows × 6 columns

We can plot the obtained results in a volcano plot:

[23]:

dc.plot_volcano_df(
    results_df,
    x='log2FoldChange',
    y='padj',
    top=20,
    figsize=(8, 4)
)

../_images/notebooks_pseudobulk_43_0.png

After performing DEA, we can use the obtained gene level statistics to perform enrichment analysis. Any statistic can be used, but we recommend using the t-values instead of logFCs since t-values incorporate the significance of change in their value. We will transform the obtained t-values stored in stats to a wide matrix so that it can be used by decoupler:

[24]:

mat = results_df[['stat']].T.rename(index={'stat': 'T cell'})
mat

[24]:

gene_symbol	A1BG	A2M	A2MP1	AAAS	AACS	AAGAB	AAK1	AAMDC	AAMP	AAR2	...	ZUP1	ZWILCH	ZWINT	ZXDA	ZXDB	ZXDC	ZYG11B	ZYX	ZZEF1	ZZZ3
T cell	-1.166608	-3.832084	0.429274	0.599398	1.066916	-0.019375	-0.632058	0.259132	1.467107	-0.454862	...	1.559071	1.360735	3.547976	-0.786136	-3.314158	-0.985527	0.8239	1.319482	0.078353	-0.295476

1 rows × 10270 columns

Transcription factor activity inference

The first functional analysis we can perform is to infer transcription factor (TF) activities from our transcriptomics data. We will need a gene regulatory network (GRN) and a statistical method.

CollecTRI network

CollecTRI is a comprehensive resource containing a curated collection of TFs and their transcriptional targets compiled from 12 different resources. This collection provides an increased coverage of transcription factors and a superior performance in identifying perturbed TFs compared to our previous DoRothEA network and other literature based GRNs. Similar to DoRothEA, interactions are weighted by their mode of regulation (activation or inhibition).

For this example we will use the human version (mouse and rat are also available). We can use decoupler to retrieve it from omnipath. The argument split_complexes keeps complexes or splits them into subunits, by default we recommend to keep complexes together.

Note

In this tutorial we use the network CollecTRI, but we could use any other GRN coming from an inference method such as CellOracle, pySCENIC or SCENIC+.

[25]:

# Retrieve CollecTRI gene regulatory network
collectri = dc.get_collectri(organism='human', split_complexes=False)
collectri

[25]:

	source	target	weight	PMID
0	MYC	TERT	1	10022128;10491298;10606235;10637317;10723141;1...
1	SPI1	BGLAP	1	10022617
2	SMAD3	JUN	1	10022869;12374795
3	SMAD4	JUN	1	10022869;12374795
4	STAT5A	IL2	1	10022878;11435608;17182565;17911616;22854263;2...
...	...	...	...	...
43173	NFKB	hsa-miR-143-3p	1	19472311
43174	AP1	hsa-miR-206	1	19721712
43175	NFKB	hsa-miR-21-5p	1	20813833;22387281
43176	NFKB	hsa-miR-224-5p	1	23474441;23988648
43177	AP1	hsa-miR-144	1	23546882

43178 rows × 4 columns

Activity inference with Univariate Linear Model (ULM)

To infer TF enrichment scores we will run the Univariate Linear Model (ulm) method. For each sample in our dataset (mat) and each TF in our network (net), it fits a linear model that predicts the observed gene expression based solely on the TF’s TF-Gene interaction weights. Once fitted, the obtained t-value of the slope is the score. If it is positive, we interpret that the TF is active and if it is negative we interpret that it is inactive.

78918b7943ba4dbb9ee88584308ffed5

We can run ulm with a one-liner:

[26]:

# Infer pathway activities with ulm
tf_acts, tf_pvals = dc.run_ulm(mat=mat, net=collectri)
tf_acts

[26]:

	ABL1	AHR	AIRE	AP1	APEX1	AR	ARID3B	ARID4A	ARID5B	ARNT	...	ZNF350	ZNF354C	ZNF362	ZNF382	ZNF384	ZNF395	ZNF436	ZNF699	ZNF76	ZNF91
T cell	0.665785	1.382473	-1.763323	-0.864442	0.057612	0.100369	-0.590489	-0.141929	1.003913	-0.455601	...	0.651418	-1.856305	1.019936	0.053302	-1.483298	1.173904	0.410554	-0.319962	1.88082	-1.32186

1 rows × 547 columns

Let us plot the obtained scores for the top active/inactive transcription factors:

[27]:

dc.plot_barplot(
    acts=tf_acts,
    contrast='T cell',
    top=25,
    vertical=True,
    figsize=(3, 6)
)

../_images/notebooks_pseudobulk_51_0.png

In accordance to the previous pathway results, T cells seem to activate for TFs responsible for cell growth (E2F4, TFDP1, E2F1).

Like with pathways, we can explore how the target genes look like:

[28]:

# Extract logFCs and pvals
logFCs = results_df[['log2FoldChange']].T.rename(index={'log2FoldChange': 'T cell'})
pvals = results_df[['padj']].T.rename(index={'padj': 'T cell'})

# Plot
dc.plot_volcano(
    logFCs=logFCs,
    pvals=pvals,
    contrast='T cell',
    name='E2F4',
    net=collectri,
    top=10,
    sign_thr=0.05,
    lFCs_thr=0.5
)

../_images/notebooks_pseudobulk_54_0.png

We can also plot the network of interesting TFs (top and bottom by activity) and color the nodes by activity and target gene expression:

[29]:

dc.plot_network(
    net=collectri,
    obs=mat,
    act=tf_acts,
    n_sources=['MYC', 'E2F4', 'HSF1', 'GATA6'],
    n_targets=15,
    node_size=100,
    figsize=(7, 7),
    c_pos_w='darkgreen',
    c_neg_w='darkred',
    vcenter=True
)

../_images/notebooks_pseudobulk_56_0.png

Green edges are positive regulation (activation), red edges are negative regulation (inactivation):

Pathway activity inference

Another analysis we can perform is to infer pathway activities from our transcriptomics data.

PROGENy model

PROGENy is a comprehensive resource containing a curated collection of pathways and their target genes, with weights for each interaction. For this example we will use the human weights (other organisms are available) and we will use the top 500 responsive genes ranked by p-value. Here is a brief description of each pathway:

Androgen: involved in the growth and development of the male reproductive organs.
EGFR: regulates growth, survival, migration, apoptosis, proliferation, and differentiation in mammalian cells
Estrogen: promotes the growth and development of the female reproductive organs.
Hypoxia: promotes angiogenesis and metabolic reprogramming when O2 levels are low.
JAK-STAT: involved in immunity, cell division, cell death, and tumor formation.
MAPK: integrates external signals and promotes cell growth and proliferation.
NFkB: regulates immune response, cytokine production and cell survival.
p53: regulates cell cycle, apoptosis, DNA repair and tumor suppression.
PI3K: promotes growth and proliferation.
TGFb: involved in development, homeostasis, and repair of most tissues.
TNFa: mediates haematopoiesis, immune surveillance, tumour regression and protection from infection.
Trail: induces apoptosis.
VEGF: mediates angiogenesis, vascular permeability, and cell migration.
WNT: regulates organ morphogenesis during development and tissue repair.

To access it we can use decoupler.

[30]:

# Retrieve PROGENy model weights
progeny = dc.get_progeny(top=500)
progeny

[30]:

	source	target	weight	p_value
0	Androgen	TMPRSS2	11.490631	0.000000e+00
1	Androgen	NKX3-1	10.622551	2.242078e-44
2	Androgen	MBOAT2	10.472733	4.624285e-44
3	Androgen	KLK2	10.176186	1.944414e-40
4	Androgen	SARG	11.386852	2.790209e-40
...	...	...	...	...
6995	p53	ZMYM4	-2.325752	1.522388e-06
6996	p53	CFDP1	-1.628168	1.526045e-06
6997	p53	VPS37D	2.309503	1.537098e-06
6998	p53	TEDC1	-2.274823	1.547037e-06
6999	p53	CCDC138	-3.205113	1.568160e-06

7000 rows × 4 columns

Activity inference with Multivariate Linear Model (MLM)

To infer pathway enrichment scores we will run the Multivariate Linear Model (mlm) method. For each sample in our dataset (adata), it fits a linear model that predicts the observed gene expression based on all pathways’ Pathway-Gene interactions weights. Once fitted, the obtained t-values of the slopes are the scores. If it is positive, we interpret that the pathway is active and if it is negative we interpret that it is inactive.

ccb8c072809542dd94fabb41a16b2f03

We can run mlm with a one-liner:

[31]:

# Infer pathway activities with mlm
pathway_acts, pathway_pvals = dc.run_mlm(mat=mat, net=progeny)
pathway_acts

[31]:

	Androgen	EGFR	Estrogen	Hypoxia	JAK-STAT	MAPK	NFkB	PI3K	TGFb	TNFa	Trail	VEGF	WNT	p53
T cell	-0.480535	1.286544	1.866836	-6.126904	9.053461	-1.218023	-6.120244	-0.572253	-2.101866	1.693417	0.167998	0.233807	0.458145	-1.721158

Let us plot the obtained scores:

[32]:

dc.plot_barplot(
    acts=pathway_acts,
    contrast='T cell',
    top=25,
    vertical=False,
    figsize=(6, 3)
)

../_images/notebooks_pseudobulk_63_0.png

It looks like JAK-STAT, a known immunity pathway is more active in T cells from COVID-19 patients than in controls. To further explore how the target genes of a pathway of interest behave, we can plot them in scatter plot:

[33]:

dc.plot_targets(
    data=results_df,
    stat='stat',
    source_name='JAK-STAT',
    net=progeny,
    top=15
)

../_images/notebooks_pseudobulk_65_0.png

The observed activation of JAK-STAT is due to the fact that majority of its target genes with positive weights have positive t-values (1st quadrant), and the majority of the ones with negative weights have negative t-values (3d quadrant).

Functional enrichment of biological terms

Finally, we can also infer activities for general biological terms or processes.

MSigDB gene sets

The Molecular Signatures Database (MSigDB) is a resource containing a collection of gene sets annotated to different biological processes.

[34]:

# Retrieve MSigDB resource
msigdb = dc.get_resource('MSigDB')
msigdb

[34]:

	genesymbol	collection	geneset
0	MAFF	chemical_and_genetic_perturbations	BOYAULT_LIVER_CANCER_SUBCLASS_G56_DN
1	MAFF	chemical_and_genetic_perturbations	ELVIDGE_HYPOXIA_UP
2	MAFF	chemical_and_genetic_perturbations	NUYTTEN_NIPP1_TARGETS_DN
3	MAFF	immunesigdb	GSE17721_POLYIC_VS_GARDIQUIMOD_4H_BMDC_DN
4	MAFF	chemical_and_genetic_perturbations	SCHAEFFER_PROSTATE_DEVELOPMENT_12HR_UP
...	...	...	...
3838543	PRAMEF22	go_biological_process	GOBP_POSITIVE_REGULATION_OF_CELL_POPULATION_PR...
3838544	PRAMEF22	go_biological_process	GOBP_APOPTOTIC_PROCESS
3838545	PRAMEF22	go_biological_process	GOBP_REGULATION_OF_CELL_DEATH
3838546	PRAMEF22	go_biological_process	GOBP_NEGATIVE_REGULATION_OF_DEVELOPMENTAL_PROCESS
3838547	PRAMEF22	go_biological_process	GOBP_NEGATIVE_REGULATION_OF_CELL_DEATH

3838548 rows × 3 columns

As an example, we will use the hallmark gene sets, but we could have used any other.

Note

To see what other collections are available in MSigDB, type: msigdb['collection'].unique().

We can filter by for hallmark:

[35]:

# Filter by hallmark
msigdb = msigdb[msigdb['collection']=='hallmark']

# Remove duplicated entries
msigdb = msigdb[~msigdb.duplicated(['geneset', 'genesymbol'])]

# Rename
msigdb.loc[:, 'geneset'] = [name.split('HALLMARK_')[1] for name in msigdb['geneset']]

msigdb

[35]:

	genesymbol	collection	geneset
233	MAFF	hallmark	IL2_STAT5_SIGNALING
250	MAFF	hallmark	COAGULATION
270	MAFF	hallmark	HYPOXIA
373	MAFF	hallmark	TNFA_SIGNALING_VIA_NFKB
377	MAFF	hallmark	COMPLEMENT
...	...	...	...
1449668	STXBP1	hallmark	PANCREAS_BETA_CELLS
1450315	ELP4	hallmark	PANCREAS_BETA_CELLS
1450526	GCG	hallmark	PANCREAS_BETA_CELLS
1450731	PCSK2	hallmark	PANCREAS_BETA_CELLS
1450916	PAX6	hallmark	PANCREAS_BETA_CELLS

7318 rows × 3 columns

Enrichment with Over Representation Analysis (ORA)

To infer functional enrichment scores we will run the Over Representation Analysis (ora) method. As input data it accepts an expression matrix (decoupler.run_ora) or the results of differential expression analysis (decoupler.run_ora_df). For the former, by default the top 5% of expressed genes by sample are selected as the set of interest (S), and for the latter a user-defined significance filtering can be used. Once we have S, it builds a contingency table using set operations for each set stored in the gene set resource being used (net). Using the contingency table, ora performs a one-sided Fisher exact test to test for significance of overlap between sets. The final score is obtained by log-transforming the obtained p-values, meaning that higher values are more significant.

0e899a8c9bee46c4ad18a86cb853051f

We can run ora with a simple one-liner:

[36]:

# Infer enrichment with ora using significant deg
top_genes = results_df[results_df['padj'] < 0.05]

# Run ora
enr_pvals = dc.get_ora_df(
    df=top_genes,
    net=msigdb,
    source='geneset',
    target='genesymbol'
)

enr_pvals.head()

[36]:

	Term	Set size	Overlap ratio	p-value	FDR p-value	Odds ratio	Combined score	Features
0	ADIPOGENESIS	200	0.030000	0.339486	0.462078	1.367224	1.477044	ACAA2;AIFM1;DECR1;NMT1;PPM1B;TALDO1
1	ALLOGRAFT_REJECTION	200	0.035000	0.199268	0.348719	1.580933	2.550212	CSK;FGR;GBP2;HLA-E;PTPN6;STAT4;TNF
2	ANDROGEN_RESPONSE	100	0.020000	0.690477	0.782571	1.045485	0.387219	HERC3;RPS6KA3
3	APICAL_JUNCTION	200	0.020000	0.702717	0.782571	0.942524	0.332524	ACTG1;CD276;CERCAM;SHC1
4	APICAL_SURFACE	44	0.022727	0.653115	0.782571	1.417684	0.603936	ADAM10

Then we can visualize the most enriched terms:

[37]:

dc.plot_dotplot(
    enr_pvals.sort_values('Combined score', ascending=False).head(15),
    x='Combined score',
    y='Term',
    s='Odds ratio',
    c='FDR p-value',
    scale=0.5,
    figsize=(3, 9)
)

../_images/notebooks_pseudobulk_74_0.png

We can also plot the running score for a given gene set:

[38]:

# Plot
dc.plot_running_score(
    df=results_df,
    stat='stat',
    net=msigdb,
    source='geneset',
    target='genesymbol',
    set_name='MYC_TARGETS_V1'
)

../_images/notebooks_pseudobulk_76_0.png