Spatial functional analysis

Spatial transcriptomics technologies yield many molecular readouts that are hard to interpret by themselves. One way of summarizing this information is by inferring biological activities from prior knowledge.

In this notebook we showcase how to use decoupler for transcription factor and pathway activity inference from a human data-set. The data consists of a 10X Genomics Visium slide of a human lymph node and it is available at their website.

Note

This tutorial assumes that you already know the basics of decoupler. Else, check out the Usage tutorial first.

Loading packages

First, we need to load the relevant packages, scanpy to handle RNA-seq data and decoupler to use statistical methods.

[1]:

import scanpy as sc
import decoupler as dc

# Only needed for processing
import numpy as np
import pandas as pd

# Plotting options, change to your liking
sc.settings.set_figure_params(dpi=200, frameon=False)
sc.set_figure_params(dpi=200)
sc.set_figure_params(figsize=(4, 4))

Loading the data

We can download the data easily using scanpy:

[2]:

adata = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
adata.var_names_make_unique()
adata

/home/badi/miniconda3/envs/dcp39/lib/python3.9/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")

[2]:

AnnData object with n_obs × n_vars = 4035 × 36601
    obs: 'in_tissue', 'array_row', 'array_col'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial'
    obsm: 'spatial'

QC, projection and clustering

Here we follow the standard pre-processing steps as described in the scanpy vignette. These steps carry out the selection and filtration of spots based on quality control metrics, the data normalization and scaling, and the detection of highly variable features.

Note

This is just an example, these steps should change depending on the data.

[3]:

# Basic filtering
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=10)

# Annotate the group of mitochondrial genes as 'mt'
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

# Filter cells following standard QC criteria.
adata = adata[adata.obs.pct_counts_mt < 20, :]

# Normalize the data
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# Identify highly variable genes
sc.pp.highly_variable_genes(adata)

# Filter higly variable genes
adata.raw = adata
adata = adata[:, adata.var.highly_variable]

# Scale the data
sc.pp.scale(adata, max_value=10)

/home/badi/miniconda3/envs/dcp39/lib/python3.9/site-packages/scanpy/preprocessing/_normalization.py:170: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)
/home/badi/miniconda3/envs/dcp39/lib/python3.9/site-packages/scanpy/preprocessing/_simple.py:843: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)

Then we group spots based on the similarity of their transcription profiles. We can visualize the obtained clusters on a UMAP or directly at the tissue:

[4]:

# Generate PCA features
sc.tl.pca(adata, svd_solver='arpack')

# Compute distances in the PCA space, and find spot neighbors
sc.pp.neighbors(adata)

# Run leiden clustering algorithm
sc.tl.leiden(adata)

# Visualize
sc.pl.spatial(adata, color=[None, 'leiden'], size=1.5, wspace=0)

Transcription factor activity inference

The first functional analysis we can perform is to infer transcription factor (TF) activities from our transcriptomics data. We will need a gene regulatory network (GRN) and a statistical method.

CollecTRI network

CollecTRI is a comprehensive resource containing a curated collection of TFs and their transcriptional targets compiled from 12 different resources. This collection provides an increased coverage of transcription factors and a superior performance in identifying perturbed TFs compared to our previous DoRothEA network and other literature based GRNs. Similar to DoRothEA, interactions are weighted by their mode of regulation (activation or inhibition).

For this example we will use the human version (mouse and rat are also available). We can use decoupler to retrieve it from omnipath. The argument split_complexes keeps complexes or splits them into subunits, by default we recommend to keep complexes together.

Note

In this tutorial we use the network CollecTRI, but we could use any other GRN coming from an inference method such as CellOracle, pySCENIC or SCENIC+.

[5]:

net = dc.get_collectri(organism='human', split_complexes=False)
net

[5]:

	source	target	weight
0	MYC	TERT	1
1	SPI1	BGLAP	1
2	SMAD3	JUN	1
3	SMAD4	JUN	1
4	STAT5A	IL2	1
...	...	...	...
43173	NFKB	hsa-miR-143-3p	1
43174	AP1	hsa-miR-206	1
43175	NFKB	hsa-miR-21-5p	1
43176	NFKB	hsa-miR-224-5p	1
43177	AP1	hsa-miR-144	1

43178 rows × 3 columns

Activity inference with Univariate Linear Model (ULM)

To infer TF enrichment scores we will run the Univariate Linear Model (ulm) method. For each spot in our slide (adata) and each TF in our network (net), it fits a linear model that predicts the observed gene expression based solely on the TF’s TF-Gene interaction weights. Once fitted, the obtained t-value of the slope is the score. If it is positive, we interpret that the TF is active and if it is negative we interpret that it is inactive.

185cfff1ffda4774bfbaff3da11e13b7

To run decoupler methods, we need an input matrix (mat), an input prior knowledge network/resource (net), and the name of the columns of net that we want to use.

[6]:

dc.run_ulm(
    mat=adata,
    net=net,
    source='source',
    target='target',
    weight='weight',
    verbose=True
)

Running ulm on mat with 4032 samples and 19813 targets for 714 sources.

The obtained scores (ulm_estimate) and p-values (ulm_pvals) are stored in the .obsm key:

[7]:

adata.obsm['ulm_estimate']

[7]:

	ABL1	AHR	AIP	AIRE	AP1	APEX1	AR	ARID1A	ARID1B	ARID3A	...	ZNF382	ZNF384	ZNF395	ZNF423	ZNF436	ZNF699	ZNF76	ZNF804A	ZNF91	ZXDC
AAACAAGTATCTCCCA-1	1.664750	3.047699	-0.592467	0.877489	8.272893	1.953551	7.063058	-0.412894	1.346660	1.669344	...	-3.154237	0.754902	0.323659	-1.135385	-1.544191	-2.494190	-0.174564	0.295525	1.596735	4.779571
AAACAATCTACTAGCA-1	2.581898	2.378814	0.229500	0.778728	7.182971	1.401599	6.705141	0.292956	0.971177	1.500558	...	-1.103691	-0.687120	-0.153904	-1.147676	-1.532484	-2.124298	1.017781	1.731435	1.530658	5.158500
AAACACCAATAACTGC-1	2.523555	2.761837	-0.631667	1.626131	9.711512	3.126717	7.564027	0.062808	1.038032	0.999655	...	-1.156439	-0.824698	0.278299	-1.249198	-1.180094	-1.638796	0.132091	2.430477	1.772135	4.483826
AAACAGAGCGACTCCT-1	0.837963	2.871503	0.518495	1.142492	6.747281	2.877284	6.709942	-0.057628	2.192226	2.020796	...	-1.263882	-0.988079	0.465172	-1.159395	-1.353688	-1.574860	0.418315	1.710886	1.780143	3.950296
AAACAGCTTTCAGAAG-1	1.339041	2.440300	-1.148554	0.170774	6.167157	4.304081	5.498827	0.590393	3.029766	3.160377	...	-2.252383	-0.707415	-0.234697	-1.098986	-0.259631	-1.141442	0.717248	1.339858	1.387293	5.086198
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
TTGTTTCACATCCAGG-1	1.237580	2.214945	-0.950439	1.618132	9.117451	3.436213	6.834700	0.474547	1.792993	0.948023	...	-0.898607	0.761866	-0.296092	-1.086729	-1.588613	-1.606002	0.556196	0.665712	1.902789	4.814034
TTGTTTCATTAGTCTA-1	2.713764	3.681611	0.060108	1.206949	9.005407	3.363005	7.625255	0.356119	1.791808	2.384459	...	-1.778395	-0.433268	-1.025911	-1.153008	-1.209028	-1.989924	0.114471	0.723853	1.998366	3.799276
TTGTTTCCATACAACT-1	2.597066	2.997469	0.514384	1.954959	9.937597	4.220734	7.798340	0.750334	3.398923	1.488187	...	-1.940908	0.891604	0.078869	-0.777850	-1.847299	-1.873346	0.386887	1.877362	0.970325	5.360460
TTGTTTGTATTACACG-1	0.708416	2.617353	1.430133	0.380854	6.830229	2.223228	6.141316	-0.026553	-0.400155	0.139582	...	-1.215524	-0.840903	0.953330	-0.400155	-0.228682	-1.461496	-0.025745	1.854268	1.899999	4.053257
TTGTTTGTGTAAATTC-1	1.334345	2.349135	-0.075793	1.334154	7.452676	1.689037	5.969805	-0.151444	1.065645	1.528148	...	-0.754954	-1.515193	-0.189025	-1.161259	-1.145015	-1.415371	0.815911	0.796343	0.757068	3.184180

4032 rows × 714 columns

Note: Each run of run_ulm overwrites what is inside of ulm_estimate and ulm_pvals. if you want to run ulm with other resources and still keep the activities inside the same AnnData object, you can store the results in any other key in .obsm with different names, for example:

[8]:

adata.obsm['collectri_ulm_estimate'] = adata.obsm['ulm_estimate'].copy()
adata.obsm['collectri_ulm_pvals'] = adata.obsm['ulm_pvals'].copy()
adata

[8]:

AnnData object with n_obs × n_vars = 4032 × 2491
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'spatial', 'log1p', 'hvg', 'pca', 'neighbors', 'leiden', 'leiden_colors'
    obsm: 'spatial', 'X_pca', 'ulm_estimate', 'ulm_pvals', 'collectri_ulm_estimate', 'collectri_ulm_pvals'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Visualization

To visualize the obtained scores, we can re-use many of scanpy’s plotting functions. First though, we need to extract them from the adata object.

[9]:

acts = dc.get_acts(adata, obsm_key='collectri_ulm_estimate')

# We can scale the obtained activities for better visualizations
sc.pp.scale(acts)
acts

[9]:

AnnData object with n_obs × n_vars = 4032 × 714
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden'
    var: 'mean', 'std'
    uns: 'spatial', 'log1p', 'hvg', 'pca', 'neighbors', 'leiden', 'leiden_colors'
    obsm: 'spatial', 'X_pca', 'ulm_estimate', 'ulm_pvals', 'collectri_ulm_estimate', 'collectri_ulm_pvals'

dc.get_acts returns a new AnnData object which holds the obtained activities in its .X attribute, allowing us to re-use many scanpy functions, for example:

[10]:

sc.pl.spatial(acts, color=['IRF1', 'leiden'], cmap='RdBu_r', size=1.5, vcenter=0)
sc.pl.violin(acts, keys=['IRF1'], groupby='leiden')

Here we observe the activity infered for IRF1 across spots. Interestingly, IRF1 is a known TF crucial for interferons-mediated signaling pathways. The inference of activities from “foot-prints” of target genes is more informative than just looking at the molecular readouts of a given TF, as an example here is the gene expression of IRF1, which is not very informative by itself:

[11]:

sc.pl.spatial(adata, color=['IRF1', 'leiden'], size=1.5)
sc.pl.violin(adata, keys=['IRF1'], groupby='leiden')

Exploration

Let’s identify which are the top TF per cluster. We can do it by using the function dc.rank_sources_groups, which identifies marker TFs using the same statistical tests available in scanpy’s scanpy.tl.rank_genes_groups.

[12]:

df = dc.rank_sources_groups(acts, groupby='leiden', reference='rest', method='t-test_overestim_var')
df

[12]:

	group	reference	names	statistic	meanchange	pvals	pvals_adj
0	0	rest	TBXT	18.068941	0.917299	4.920475e-66	1.171073e-63
1	0	rest	XBP1	16.784688	0.846852	5.097934e-58	9.099811e-56
2	0	rest	ATF6	16.281186	0.835455	6.567713e-55	9.378695e-53
3	0	rest	NFKBIZ	13.128910	0.646914	2.403248e-37	2.451313e-35
4	0	rest	ETV6	12.829867	0.622046	8.342103e-36	6.618068e-34
...	...	...	...	...	...	...	...
7135	9	rest	TFEB	-3.292531	-0.401085	1.167719e-03	3.928805e-02
7136	9	rest	JUND	-3.508939	-0.407925	5.597394e-04	2.350905e-02
7137	9	rest	HOPX	-3.569018	-0.471087	4.407924e-04	2.248041e-02
7138	9	rest	PGR	-4.104149	-0.466410	6.080808e-05	4.044638e-03
7139	9	rest	SRSF2	-4.541369	-0.585178	9.282127e-06	1.104573e-03

7140 rows × 7 columns

We can then extract the top 3 markers per cluster:

[13]:

n_markers = 3
source_markers = df.groupby('group').head(n_markers).groupby('group')['names'].apply(lambda x: list(x)).to_dict()
source_markers

[13]:

{'0': ['TBXT', 'XBP1', 'ATF6'],
 '1': ['RFX5', 'RFXANK', 'RFXAP'],
 '2': ['STAT4', 'TRERF1', 'BRD4'],
 '3': ['JUN', 'EGR1', 'SP1'],
 '4': ['SCX', 'THRA', 'GATA6'],
 '5': ['E2F4', 'E2F1', 'E2F3'],
 '6': ['CEBPG', 'SPI1', 'FOS'],
 '7': ['ERG', 'NFIX', 'MYOCD'],
 '8': ['IRF9', 'IRF3', 'STAT2'],
 '9': ['E2F3', 'E2F1', 'MYC']}

We can plot the obtained markers:

[14]:

sc.pl.matrixplot(acts, source_markers, 'leiden', dendrogram=True,
                 colorbar_title='Z-scaled scores', vmin=-2, vmax=2, cmap='RdBu_r')

WARNING: dendrogram data not found (using key=dendrogram_leiden). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.

We can visualize again on the tissue some marker TFs:

[15]:

sc.pl.spatial(acts, color=['leiden', 'STAT1', 'RFX5', 'E2F4'], cmap='RdBu_r', size=1.5, vcenter=0)

Pathway activity inference

We can also infer pathway activities from our transcriptomics data.

PROGENy model

PROGENy is a comprehensive resource containing a curated collection of pathways and their target genes, with weights for each interaction. For this example we will use the human weights (other organisms are available) and we will use the top 500 responsive genes ranked by p-value. Here is a brief description of each pathway:

Androgen: involved in the growth and development of the male reproductive organs.
EGFR: regulates growth, survival, migration, apoptosis, proliferation, and differentiation in mammalian cells
Estrogen: promotes the growth and development of the female reproductive organs.
Hypoxia: promotes angiogenesis and metabolic reprogramming when O2 levels are low.
JAK-STAT: involved in immunity, cell division, cell death, and tumor formation.
MAPK: integrates external signals and promotes cell growth and proliferation.
NFkB: regulates immune response, cytokine production and cell survival.
p53: regulates cell cycle, apoptosis, DNA repair and tumor suppression.
PI3K: promotes growth and proliferation.
TGFb: involved in development, homeostasis, and repair of most tissues.
TNFa: mediates haematopoiesis, immune surveillance, tumour regression and protection from infection.
Trail: induces apoptosis.
VEGF: mediates angiogenesis, vascular permeability, and cell migration.
WNT: regulates organ morphogenesis during development and tissue repair.

To access it we can use decoupler.

[16]:

progeny = dc.get_progeny(organism='human', top=500)
progeny

[16]:

	source	target	weight	p_value
0	Androgen	TMPRSS2	11.490631	0.000000e+00
1	Androgen	NKX3-1	10.622551	2.242078e-44
2	Androgen	MBOAT2	10.472733	4.624285e-44
3	Androgen	KLK2	10.176186	1.944414e-40
4	Androgen	SARG	11.386852	2.790209e-40
...	...	...	...	...
6995	p53	ZMYM4	-2.325752	1.522388e-06
6996	p53	CFDP1	-1.628168	1.526045e-06
6997	p53	VPS37D	2.309503	1.537098e-06
6998	p53	TEDC1	-2.274823	1.547037e-06
6999	p53	CCDC138	-3.205113	1.568160e-06

7000 rows × 4 columns

Activity inference with Multivariate Linear Model (MLM)

To infer pathway enrichment scores we will run the Multivariate Linear Model (mlm) method. For each spot in our slide (adata), it fits a linear model that predicts the observed gene expression based on all pathways’ Pathway-Gene interactions weights. Once fitted, the obtained t-values of the slopes are the scores. If it is positive, we interpret that the pathway is active and if it is negative we interpret that it is inactive.

d1365844ab294c2a9358c25472fe3d49

We can run mlm with a simple one-liner:

[17]:

dc.run_mlm(
    mat=adata,
    net=progeny,
    source='source',
    target='target',
    weight='weight',
    verbose=True
)

# Store in new obsm keys
adata.obsm['progeny_mlm_estimate'] = adata.obsm['mlm_estimate'].copy()
adata.obsm['progeny_mlm_pvals'] = adata.obsm['mlm_pvals'].copy()

Running mlm on mat with 4032 samples and 19813 targets for 14 sources.

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.32s/it]

The obtained scores (t-values)(mlm_estimate) and p-values (mlm_pvals) are stored in the .obsm key:

[18]:

adata.obsm['progeny_mlm_estimate']

[18]:

	Androgen	EGFR	Estrogen	Hypoxia	JAK-STAT	MAPK	NFkB	PI3K	TGFb	TNFa	Trail	VEGF	WNT	p53
AAACAAGTATCTCCCA-1	0.663306	2.264166	-0.574110	2.821547	5.734972	-1.251062	-0.985325	-2.134703	0.800705	3.148244	-3.194392	0.497062	0.195314	-4.728683
AAACAATCTACTAGCA-1	-0.189999	0.706570	-1.188558	1.091502	7.026889	-1.989223	-0.815746	-3.118567	-0.691568	3.120665	-3.723227	-0.333131	0.356573	-4.122621
AAACACCAATAACTGC-1	0.878165	2.738217	-1.311710	1.427227	7.337608	-2.275766	-0.471301	-3.337509	0.412428	3.806539	-2.556233	0.013628	0.136550	-4.324521
AAACAGAGCGACTCCT-1	1.193238	1.884050	-1.380661	1.601653	27.206005	-2.138226	0.201282	-2.917499	-0.135760	2.028663	-2.531914	0.084890	0.150363	-4.480249
AAACAGCTTTCAGAAG-1	0.687317	1.209213	-1.195251	1.994144	6.689987	-2.280192	-1.393017	-3.630782	-1.515401	3.209753	-2.255108	0.030629	0.992789	-5.051840
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
TTGTTTCACATCCAGG-1	0.832172	2.937038	-0.822877	3.202930	7.324937	-2.966689	-1.880574	-2.675349	-0.611200	4.197227	-1.183812	0.645803	0.540514	-4.156600
TTGTTTCATTAGTCTA-1	1.346339	2.535902	-1.651201	3.738045	6.091091	-1.763673	-0.191602	-2.441693	-0.916086	3.459913	-2.985128	0.335326	-0.100866	-3.566472
TTGTTTCCATACAACT-1	0.360408	2.331512	-1.077590	2.127944	5.772553	-2.469587	0.458196	-2.803331	0.580796	3.329211	-3.176442	0.032048	0.308023	-2.841499
TTGTTTGTATTACACG-1	0.284011	1.694669	-0.879540	2.717103	7.985269	-1.506677	0.012385	-2.236315	0.370454	2.233846	-2.405654	-0.666762	0.509044	-3.003024
TTGTTTGTGTAAATTC-1	0.292245	1.879703	-0.908702	1.047743	7.130506	-2.432401	-0.748133	-2.873597	-1.601786	2.778569	-3.381333	0.563160	-0.086961	-3.837238

4032 rows × 14 columns

Visualization

Like in the previous section, we will extract the activities from the adata object.

[19]:

acts = dc.get_acts(adata, obsm_key='progeny_mlm_estimate')

# We can scale the obtained activities for better visualizations
sc.pp.scale(acts)
acts

[19]:

AnnData object with n_obs × n_vars = 4032 × 14
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden'
    var: 'mean', 'std'
    uns: 'spatial', 'log1p', 'hvg', 'pca', 'neighbors', 'leiden', 'leiden_colors'
    obsm: 'spatial', 'X_pca', 'ulm_estimate', 'ulm_pvals', 'collectri_ulm_estimate', 'collectri_ulm_pvals', 'mlm_estimate', 'mlm_pvals', 'progeny_mlm_estimate', 'progeny_mlm_pvals'

Once extracted we can visualize them:

[20]:

sc.pl.spatial(acts, color=['Trail', 'leiden'], cmap='RdBu_r', vcenter=0, size=1.5)
sc.pl.violin(acts, keys=['Trail'], groupby='leiden')

Here we show the activity of the pathway Trail, which is associated with apoptosis.

Exploration

We can visualize which pathways are more active in each cluster:

[21]:

sc.pl.matrixplot(acts, var_names=acts.var_names, groupby='leiden', dendrogram=True,
                 colorbar_title='Z-scaled scores', vmin=-2, vmax=2, cmap='RdBu_r')

WARNING: dendrogram data not found (using key=dendrogram_leiden). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.

We can visualize again on the tissue some cluster-specific pathways:

[22]:

sc.pl.spatial(acts, color=['leiden', 'JAK-STAT', 'PI3K', 'TGFb'], cmap='RdBu_r', size=1.5, vcenter=0)

Functional enrichment of biological terms

Finally, we can also infer activities for general biological terms or processes.

MSigDB gene sets

The Molecular Signatures Database (MSigDB) is a resource containing a collection of gene sets annotated to different biological processes.

[23]:

msigdb = dc.get_resource('MSigDB')
msigdb

[23]:

	genesymbol	collection	geneset
0	MAFF	chemical_and_genetic_perturbations	BOYAULT_LIVER_CANCER_SUBCLASS_G56_DN
1	MAFF	chemical_and_genetic_perturbations	ELVIDGE_HYPOXIA_UP
2	MAFF	chemical_and_genetic_perturbations	NUYTTEN_NIPP1_TARGETS_DN
3	MAFF	immunesigdb	GSE17721_POLYIC_VS_GARDIQUIMOD_4H_BMDC_DN
4	MAFF	chemical_and_genetic_perturbations	SCHAEFFER_PROSTATE_DEVELOPMENT_12HR_UP
...	...	...	...
3838543	PRAMEF22	go_biological_process	GOBP_POSITIVE_REGULATION_OF_CELL_POPULATION_PR...
3838544	PRAMEF22	go_biological_process	GOBP_APOPTOTIC_PROCESS
3838545	PRAMEF22	go_biological_process	GOBP_REGULATION_OF_CELL_DEATH
3838546	PRAMEF22	go_biological_process	GOBP_NEGATIVE_REGULATION_OF_DEVELOPMENTAL_PROCESS
3838547	PRAMEF22	go_biological_process	GOBP_NEGATIVE_REGULATION_OF_CELL_DEATH

3838548 rows × 3 columns

As an example, we will use the hallmark gene sets, but we could have used any other.

Note

To see what other collections are available in MSigDB, type: msigdb['collection'].unique().

We can filter by for hallmark:

[24]:

# Filter by hallmark
msigdb = msigdb[msigdb['collection']=='hallmark']

# Remove duplicated entries
msigdb = msigdb[~msigdb.duplicated(['geneset', 'genesymbol'])]

# Rename
msigdb.loc[:, 'geneset'] = [name.split('HALLMARK_')[1] for name in msigdb['geneset']]

msigdb

[24]:

	genesymbol	collection	geneset
233	MAFF	hallmark	IL2_STAT5_SIGNALING
250	MAFF	hallmark	COAGULATION
270	MAFF	hallmark	HYPOXIA
373	MAFF	hallmark	TNFA_SIGNALING_VIA_NFKB
377	MAFF	hallmark	COMPLEMENT
...	...	...	...
1449668	STXBP1	hallmark	PANCREAS_BETA_CELLS
1450315	ELP4	hallmark	PANCREAS_BETA_CELLS
1450526	GCG	hallmark	PANCREAS_BETA_CELLS
1450731	PCSK2	hallmark	PANCREAS_BETA_CELLS
1450916	PAX6	hallmark	PANCREAS_BETA_CELLS

7318 rows × 3 columns

Enrichment with Over Representation Analysis (ORA)

To infer functional enrichment scores we will run the Over Representation Analysis (ora) method. As input data it accepts an expression matrix (decoupler.run_ora) or the results of differential expression analysis (decoupler.run_ora_df). For the former, by default the top 5% of expressed genes by sample are selected as the set of interest (S*), and for the latter a user-defined significance filtering can be used. Once we have S*, it builds a contingency table using set operations for each set stored in the gene set resource being used (net). Using the contingency table, ora performs a one-sided Fisher exact test to test for significance of overlap between sets. The final score is obtained by log-transforming the obtained p-values, meaning that higher values are more significant.

40f2e4601cf54896b5f0ff69b1e63931

We can run ora with a simple one-liner:

[25]:

dc.run_ora(
    mat=adata,
    net=msigdb,
    source='geneset',
    target='genesymbol',
    verbose=True
)

# Store in a different key
adata.obsm['msigdb_ora_estimate'] = adata.obsm['ora_estimate'].copy()
adata.obsm['msigdb_ora_pvals'] = adata.obsm['ora_pvals'].copy()

Running ora on mat with 4032 samples and 19813 targets for 50 sources.

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4032/4032 [00:09<00:00, 423.10it/s]
/home/badi/miniconda3/envs/dcp39/lib/python3.9/site-packages/pandas/core/internals/blocks.py:351: RuntimeWarning: divide by zero encountered in log10
  result = func(self.values, **kwargs)

The obtained scores (-log10(p-value))(ora_estimate) and p-values (ora_pvals) are stored in the .obsm key:

[26]:

adata.obsm['msigdb_ora_estimate'].iloc[:, 0:5]

[26]:

source	ADIPOGENESIS	ALLOGRAFT_REJECTION	ANDROGEN_RESPONSE	ANGIOGENESIS	APICAL_JUNCTION
AAACAAGTATCTCCCA-1	3.242753	26.851582	2.558473	1.603317	3.921909
AAACAATCTACTAGCA-1	3.653319	23.258690	0.985763	0.626249	2.333304
AAACACCAATAACTGC-1	4.084916	23.258690	2.558473	1.066839	4.846537
AAACAGAGCGACTCCT-1	1.529659	30.628265	2.103961	1.066839	3.082310
AAACAGCTTTCAGAAG-1	3.242753	22.390085	2.103961	0.626249	1.393273
...	...	...	...	...	...
TTGTTTCACATCCAGG-1	3.242753	15.890555	3.577630	1.066839	4.373918
TTGTTTCATTAGTCTA-1	3.653319	25.031700	4.732168	2.225120	4.846537
TTGTTTCCATACAACT-1	5.008742	22.390085	3.577630	1.066839	3.491148
TTGTTTGTATTACACG-1	1.529659	23.258690	1.315297	0.626249	2.333304
TTGTTTGTGTAAATTC-1	2.144016	22.390085	1.315297	0.294479	3.491148

4032 rows × 5 columns

Visualization

Like in the previous sections, we will extract the activities from the adata object.

[27]:

acts = dc.get_acts(adata, obsm_key='msigdb_ora_estimate')

# We need to remove inf and set them to the maximum value observed
acts_v = acts.X.ravel()
max_e = np.nanmax(acts_v[np.isfinite(acts_v)])
acts.X[~np.isfinite(acts.X)] = max_e

# We can scale the obtained activities for better visualizations
sc.pp.scale(acts)
acts

[27]:

AnnData object with n_obs × n_vars = 4032 × 50
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden'
    var: 'mean', 'std'
    uns: 'spatial', 'log1p', 'hvg', 'pca', 'neighbors', 'leiden', 'leiden_colors'
    obsm: 'spatial', 'X_pca', 'ulm_estimate', 'ulm_pvals', 'collectri_ulm_estimate', 'collectri_ulm_pvals', 'mlm_estimate', 'mlm_pvals', 'progeny_mlm_estimate', 'progeny_mlm_pvals', 'ora_estimate', 'ora_pvals', 'msigdb_ora_estimate', 'msigdb_ora_pvals'

Once extracted we can visualize them:

[28]:

sc.pl.spatial(acts, color=['APICAL_JUNCTION', 'leiden'], cmap='RdBu_r', vcenter=0, size=1.5)
sc.pl.violin(acts, keys=['APICAL_JUNCTION'], groupby='leiden')

Exploration

Let’s identify which are the top terms per cluster. We can do it by using the function dc.rank_sources_groups, as shown before.

[29]:

df = dc.rank_sources_groups(acts, groupby='leiden', reference='rest', method='t-test_overestim_var')
df

[29]:

	group	reference	names	statistic	meanchange	pvals	pvals_adj
0	0	rest	UNFOLDED_PROTEIN_RESPONSE	11.508374	0.600239	2.056283e-29	1.028141e-27
1	0	rest	KRAS_SIGNALING_DN	4.851394	0.257573	1.355114e-06	5.452166e-06
2	0	rest	HYPOXIA	4.842609	0.237445	1.417563e-06	5.452166e-06
3	0	rest	MTORC1_SIGNALING	3.963683	0.207006	7.728492e-05	2.146803e-04
4	0	rest	UV_RESPONSE_UP	3.157070	0.159434	1.625530e-03	3.694386e-03
...	...	...	...	...	...	...	...
495	9	rest	TNFA_SIGNALING_VIA_NFKB	-3.060138	-0.357946	2.523829e-03	1.147195e-02
496	9	rest	APOPTOSIS	-3.354997	-0.391289	9.545146e-04	5.302859e-03
497	9	rest	COMPLEMENT	-3.406734	-0.411139	7.915183e-04	4.946989e-03
498	9	rest	KRAS_SIGNALING_UP	-3.687964	-0.423865	2.953626e-04	2.109733e-03
499	9	rest	INFLAMMATORY_RESPONSE	-4.972097	-0.543566	1.616106e-06	1.346755e-05

500 rows × 7 columns

We can then extract the top 3 terms per cluster:

[30]:

n_top = 3
term_markers = df.groupby('group').head(n_top).groupby('group')['names'].apply(lambda x: list(x)).to_dict()
term_markers

[30]:

{'0': ['UNFOLDED_PROTEIN_RESPONSE', 'KRAS_SIGNALING_DN', 'HYPOXIA'],
 '1': ['NOTCH_SIGNALING', 'MYC_TARGETS_V2', 'KRAS_SIGNALING_DN'],
 '2': ['ALLOGRAFT_REJECTION', 'WNT_BETA_CATENIN_SIGNALING', 'APOPTOSIS'],
 '3': ['TNFA_SIGNALING_VIA_NFKB', 'COAGULATION', 'COMPLEMENT'],
 '4': ['EPITHELIAL_MESENCHYMAL_TRANSITION', 'UV_RESPONSE_DN', 'MYOGENESIS'],
 '5': ['E2F_TARGETS', 'G2M_CHECKPOINT', 'MYC_TARGETS_V1'],
 '6': ['TNFA_SIGNALING_VIA_NFKB', 'COAGULATION', 'COMPLEMENT'],
 '7': ['COAGULATION', 'KRAS_SIGNALING_UP', 'APICAL_JUNCTION'],
 '8': ['INTERFERON_GAMMA_RESPONSE',
  'INTERFERON_ALPHA_RESPONSE',
  'INFLAMMATORY_RESPONSE'],
 '9': ['MYC_TARGETS_V1', 'OXIDATIVE_PHOSPHORYLATION', 'E2F_TARGETS']}

We can plot the obtained terms:

[31]:

sc.pl.matrixplot(acts, term_markers, 'leiden', dendrogram=True,
                 colorbar_title='Z-scaled scores', vmin=-2, vmax=2, cmap='RdBu_r')

WARNING: dendrogram data not found (using key=dendrogram_leiden). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.

We can visualize again on the tissue some cluster-specific terms:

[32]:

sc.pl.spatial(acts, color=['leiden', 'INTERFERON_ALPHA_RESPONSE', 'G2M_CHECKPOINT',
                           'TNFA_SIGNALING_VIA_NFKB'], cmap='RdBu_r', size=1.5, vcenter=0)