Cell type annotation from marker genes
In single-cell, we have no prior information of which cell type each cell belongs. To assign cell type labels, we first project all cells in a shared embedded space, then we find communities of cells that show a similar transcription profile and finally we check what cell type specific markers are expressed. If more than one marker gene is available, statistical methods can be used to test if a set of markers is enriched in a given cell population.
In this notebook we showcase how to use decoupler
for cell type annotation with the 3k PBMCs 10X data-set. The data consists of 3k PBMCs from a Healthy Donor and is freely available from 10x Genomics here from this webpage
Note
This tutorial assumes that you already know the basics of decoupler
. Else, check out the Usage tutorial first.
Loading packages
First, we need to load the relevant packages, scanpy
to handle scRNA-seq data and decoupler
to use statistical methods.
[1]:
import scanpy as sc
import decoupler as dc
import numpy as np
# Plotting options, change to your liking
sc.settings.set_figure_params(dpi=200, frameon=False)
sc.set_figure_params(dpi=200)
sc.set_figure_params(figsize=(4, 4))
Single-cell processing
Loading the data-set
We can download the data easily using scanpy
:
[2]:
adata = sc.datasets.pbmc3k()
adata
[2]:
AnnData object with n_obs × n_vars = 2700 × 32738
var: 'gene_ids'
QC, projection and clustering
Here we follow the standard pre-processing steps as described in the scanpy vignette. These steps carry out the selection and filtration of cells based on quality control metrics, the data normalization and scaling, and the detection of highly variable features.
Note
This is just an example, these steps should change depending on the data.
[3]:
# Basic filtering
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
# Annotate the group of mitochondrial genes as 'mt'
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
# Filter cells following standard QC criteria.
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :]
# Normalize the data
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# Identify the 2000 most highly variable genes
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
# Filter higly variable genes
adata.raw = adata
adata = adata[:, adata.var.highly_variable]
# Regress and scale the data
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
sc.pp.scale(adata, max_value=10)
/home/badi/miniconda3/envs/dcp39/lib/python3.9/site-packages/scanpy/preprocessing/_normalization.py:170: UserWarning: Received a view of an AnnData. Making a copy.
view_to_actual(adata)
Then we group cells based on the similarity of their transcription profiles. To visualize the communities we perform UMAP reduction.
[4]:
# Generate PCA features
sc.tl.pca(adata, svd_solver='arpack')
# Compute distances in the PCA space, and find cell neighbors
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
# Generate UMAP features
sc.tl.umap(adata)
# Run leiden clustering algorithm
sc.tl.leiden(adata)
# Visualize
sc.pl.umap(adata, color='leiden', title='RNA UMAP',
frameon=False, legend_fontweight='normal', legend_fontsize=15)
/home/badi/miniconda3/envs/dcp39/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(
At this stage, we have identified communities of cells that show a similar transcriptomic profile, and we would like to know to which cell type they probably belong.
Marker genes
To annotate single cell clusters, we can use cell type specific marker genes. These are genes that are mainly expressed exclusively by a specific cell type, making them useful to distinguish heterogeneous groups of cells. Marker genes were discovered and annotated in previous studies and there are some resources that collect and curate them.
Omnipath is one of the largest available databases of curated prior knowledge. Among its resources, there is PanglaoDB, a database of cell type markers, which can be easily accessed using a wrapper to Omnipath from decoupler
.
Note
If you encounter bugs with Omnipath
, sometimes is good to just remove its cache using: rm $HOME/.cache/omnipathdb/*
[5]:
# Query Omnipath and get PanglaoDB
markers = dc.get_resource('PanglaoDB')
markers
[5]:
genesymbol | canonical_marker | cell_type | germ_layer | human | human_sensitivity | human_specificity | mouse | mouse_sensitivity | mouse_specificity | ncbi_tax_id | organ | ubiquitiousness | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | CTRB1 | False | Enterocytes | Endoderm | True | 0.0 | 0.00439422 | True | 0.00331126 | 0.0204803 | 9606 | GI tract | 0.017 |
1 | CTRB1 | True | Acinar cells | Endoderm | True | 1.0 | 0.000628931 | True | 0.957143 | 0.0159201 | 9606 | Pancreas | 0.017 |
2 | KLK1 | True | Endothelial cells | Mesoderm | True | 0.0 | 0.00841969 | True | 0.0 | 0.0149153 | 9606 | Vasculature | 0.013 |
3 | KLK1 | False | Goblet cells | Endoderm | True | 0.588235 | 0.00503937 | True | 0.903226 | 0.0124084 | 9606 | GI tract | 0.013 |
4 | KLK1 | False | Epithelial cells | Mesoderm | True | 0.0 | 0.00823306 | True | 0.225806 | 0.0137585 | 9606 | Epithelium | 0.013 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8456 | SLC14A1 | True | Urothelial cells | Mesoderm | True | 0.0 | 0.0181704 | True | 0.0 | 0.0 | 9606 | Urinary bladder | 0.008 |
8457 | UPK3A | True | Urothelial cells | Mesoderm | True | 0.0 | 0.0 | True | 0.0 | 0.0 | 9606 | Urinary bladder | 0.0 |
8458 | UPK1A | True | Urothelial cells | Mesoderm | True | 0.0 | 0.0 | True | 0.0 | 0.0 | 9606 | Urinary bladder | 0.0 |
8459 | UPK2 | True | Urothelial cells | Mesoderm | True | 0.0 | 0.0 | True | 0.0 | 0.0 | 9606 | Urinary bladder | 0.0 |
8460 | UPK3B | True | Urothelial cells | Mesoderm | True | 0.0 | 0.0 | True | 0.0 | 0.0 | 9606 | Urinary bladder | 0.0 |
8461 rows × 13 columns
Since our data-set is from human cells, and we want best quality of the markers, we can filter by canonical_marker
and human
:
[6]:
# Filter by canonical_marker and human
markers = markers[(markers['human']=='True')&(markers['canonical_marker']=='True')]
# Remove duplicated entries
markers = markers[~markers.duplicated(['cell_type', 'genesymbol'])]
markers
[6]:
genesymbol | canonical_marker | cell_type | germ_layer | human | human_sensitivity | human_specificity | mouse | mouse_sensitivity | mouse_specificity | ncbi_tax_id | organ | ubiquitiousness | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | CTRB1 | True | Acinar cells | Endoderm | True | 1.0 | 0.000628931 | True | 0.957143 | 0.0159201 | 9606 | Pancreas | 0.017 |
2 | KLK1 | True | Endothelial cells | Mesoderm | True | 0.0 | 0.00841969 | True | 0.0 | 0.0149153 | 9606 | Vasculature | 0.013 |
5 | KLK1 | True | Principal cells | Mesoderm | True | 0.0 | 0.00814536 | True | 0.285714 | 0.0140583 | 9606 | Kidney | 0.013 |
6 | KLK1 | True | Acinar cells | Endoderm | True | 0.833333 | 0.00503145 | True | 0.314286 | 0.0128263 | 9606 | Pancreas | 0.013 |
7 | KLK1 | True | Plasmacytoid dendritic cells | Mesoderm | True | 0.0 | 0.00820189 | True | 1.0 | 0.0129136 | 9606 | Immune system | 0.013 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8456 | SLC14A1 | True | Urothelial cells | Mesoderm | True | 0.0 | 0.0181704 | True | 0.0 | 0.0 | 9606 | Urinary bladder | 0.008 |
8457 | UPK3A | True | Urothelial cells | Mesoderm | True | 0.0 | 0.0 | True | 0.0 | 0.0 | 9606 | Urinary bladder | 0.0 |
8458 | UPK1A | True | Urothelial cells | Mesoderm | True | 0.0 | 0.0 | True | 0.0 | 0.0 | 9606 | Urinary bladder | 0.0 |
8459 | UPK2 | True | Urothelial cells | Mesoderm | True | 0.0 | 0.0 | True | 0.0 | 0.0 | 9606 | Urinary bladder | 0.0 |
8460 | UPK3B | True | Urothelial cells | Mesoderm | True | 0.0 | 0.0 | True | 0.0 | 0.0 | 9606 | Urinary bladder | 0.0 |
5180 rows × 13 columns
For this example we will use these markers, but any collection of genes could be used. To see the list of available resources inside Omnipath
, run dc.show_resources()
Enrichment with Over Representation Analysis (ORA)
To infer functional enrichment scores we will run the Over Representation Analysis (ora
) method. As input data it accepts an expression matrix (decoupler.run_ora
) or the results of differential expression analysis (decoupler.run_ora_df
). For the former, by default the top 5% of expressed genes by sample are selected as the set of interest (S*), and for the latter a user-defined significance filtering can be used. Once we have S*, it builds a contingency table using set operations for
each set stored in the gene set resource being used (net
). Using the contingency table, ora
performs a one-sided Fisher exact test to test for significance of overlap between sets. The final score is obtained by log-transforming the obtained p-values, meaning that higher values are more significant.
We can run ora
with a simple one-liner:
[7]:
dc.run_ora(
mat=adata,
net=markers,
source='cell_type',
target='genesymbol',
min_n=3,
verbose=True
)
1 features of mat are empty, they will be removed.
Running ora on mat with 2638 samples and 13713 targets for 114 sources.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2638/2638 [00:02<00:00, 1033.52it/s]
/home/badi/miniconda3/envs/dcp39/lib/python3.9/site-packages/pandas/core/internals/blocks.py:351: RuntimeWarning: divide by zero encountered in log10
result = func(self.values, **kwargs)
The obtained scores (-log10(p-value))(ora_estimate
) and p-values (ora_pvals
) are stored in the .obsm
key:
[8]:
adata.obsm['ora_estimate']
[8]:
source | Acinar cells | Adipocytes | Airway goblet cells | Alpha cells | Alveolar macrophages | Astrocytes | B cells | B cells memory | B cells naive | Basophils | ... | T cells | T follicular helper cells | T helper cells | T regulatory cells | Tanycytes | Taste receptor cells | Thymocytes | Trophoblast cells | Tuft cells | Urothelial cells |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AAACATACAACCAC-1 | 0.496369 | 0.093412 | -0.000000 | 1.443226 | 0.885021 | 0.581353 | 4.624325 | 1.459548 | 1.399835 | -0.000000 | ... | 13.523759 | -0.0 | 1.036748 | -0.000000 | -0.000000 | -0.0 | 3.325696 | -0.000000 | 1.036748 | -0.0 |
AAACATTGAGCTAC-1 | 0.496369 | 1.114303 | -0.000000 | 0.569257 | -0.000000 | 1.119542 | 5.592250 | 8.719924 | 15.198236 | 0.306499 | ... | 2.431181 | -0.0 | 0.389675 | -0.000000 | -0.000000 | -0.0 | 0.827013 | 0.885021 | 2.858756 | -0.0 |
AAACATTGATCAGC-1 | 0.496369 | 0.316389 | -0.000000 | 1.443226 | 0.885021 | 1.119542 | 0.903197 | 0.966789 | 2.580070 | 1.087856 | ... | 8.800720 | -0.0 | 1.036748 | -0.000000 | 0.663965 | -0.0 | 2.367740 | -0.000000 | 1.871866 | -0.0 |
AAACCGTGCTTCCG-1 | 1.278757 | 1.659624 | -0.000000 | 1.443226 | 0.885021 | 0.195977 | 0.903197 | 0.074112 | 0.923023 | 0.306499 | ... | 0.718715 | -0.0 | -0.000000 | -0.000000 | -0.000000 | -0.0 | 0.298770 | -0.000000 | 0.389675 | -0.0 |
AAACCGTGTATGCG-1 | -0.000000 | -0.000000 | 0.885021 | 0.569257 | 0.885021 | -0.000000 | 0.903197 | 0.074112 | 0.246287 | 1.623875 | ... | 3.162941 | -0.0 | 1.036748 | -0.000000 | 0.663965 | -0.0 | 0.298770 | -0.000000 | 0.389675 | -0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
TTTCGAACTCTCAT-1 | 1.278757 | 0.661862 | -0.000000 | 1.443226 | 0.885021 | 0.581353 | 1.469558 | 0.262017 | 1.399835 | 1.087856 | ... | 1.198738 | -0.0 | 0.389675 | 0.569257 | 0.663965 | -0.0 | 2.367740 | -0.000000 | 1.036748 | -0.0 |
TTTCTACTGAGGCA-1 | -0.000000 | -0.000000 | -0.000000 | 1.443226 | 0.885021 | 0.195977 | 2.135997 | 1.459548 | 1.954926 | 1.087856 | ... | 1.198738 | -0.0 | -0.000000 | 0.569257 | 0.663965 | -0.0 | 0.827013 | -0.000000 | -0.000000 | -0.0 |
TTTCTACTTCCTCG-1 | -0.000000 | -0.000000 | 0.885021 | 1.443226 | -0.000000 | 0.195977 | 10.042270 | 10.852223 | 15.198236 | 0.089844 | ... | 1.773449 | -0.0 | 0.389675 | -0.000000 | -0.000000 | -0.0 | 3.325696 | -0.000000 | 1.036748 | -0.0 |
TTTGCATGAGAGGC-1 | -0.000000 | 0.316389 | 0.885021 | 0.569257 | -0.000000 | 0.195977 | 6.621109 | 7.714603 | 12.817177 | 0.089844 | ... | 1.198738 | -0.0 | 0.389675 | 0.569257 | -0.000000 | -0.0 | 1.527899 | -0.000000 | 1.036748 | -0.0 |
TTTGCATGCCTCAC-1 | 0.496369 | 0.093412 | -0.000000 | -0.000000 | 0.885021 | -0.000000 | 0.454158 | 0.966789 | 1.399835 | 0.306499 | ... | 7.731358 | -0.0 | 1.036748 | -0.000000 | 1.657599 | -0.0 | 1.527899 | -0.000000 | 1.871866 | -0.0 |
2638 rows × 114 columns
Visualization
To visualize the obtained scores, we can re-use many of scanpy
’s plotting functions. First though, we need to extract them from the adata
object.
[9]:
acts = dc.get_acts(adata, obsm_key='ora_estimate')
# We need to remove inf and set them to the maximum value observed
acts_v = acts.X.ravel()
max_e = np.nanmax(acts_v[np.isfinite(acts_v)])
acts.X[~np.isfinite(acts.X)] = max_e
# We can scale the obtained activities for better visualizations
sc.pp.scale(acts)
acts
[9]:
AnnData object with n_obs × n_vars = 2638 × 114
obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden'
var: 'mean', 'std'
uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors'
obsm: 'X_pca', 'X_umap', 'ora_estimate', 'ora_pvals'
dc.get_acts
returns a new AnnData
object which holds the obtained activities in its .X
attribute, allowing us to re-use many scanpy
functions, for example:
[10]:
sc.pl.umap(acts, color=['NK cells', 'leiden'], cmap='RdBu_r', vcenter=0)
sc.pl.violin(acts, keys=['NK cells'], groupby='leiden')
/home/badi/miniconda3/envs/dcp39/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(
The cells highlighted seem to be enriched by NK cell marker genes.
Annotation
With decoupler
we can also identify which are the top predicted cell types per cluster using the function dc.rank_sources_groups
. Here, it identifies “marker” cell types per cluster using same statistical tests available in scanpy’s scanpy.tl.rank_genes_groups
.
[11]:
df = dc.rank_sources_groups(acts, groupby='leiden', reference='rest', method='t-test_overestim_var')
df
[11]:
group | reference | names | statistic | meanchange | pvals | pvals_adj | |
---|---|---|---|---|---|---|---|
0 | 0 | rest | T cells | 30.576289 | 1.064311 | 1.352162e-168 | 7.707325e-167 |
1 | 0 | rest | Nuocytes | 25.162254 | 0.961658 | 1.503492e-120 | 3.427962e-119 |
2 | 0 | rest | T helper cells | 17.692756 | 0.711054 | 2.320398e-65 | 2.204379e-64 |
3 | 0 | rest | Epiblast cells | 15.641921 | 0.649426 | 1.262700e-51 | 9.596521e-51 |
4 | 0 | rest | Pericytes | 13.220418 | 0.541371 | 1.859599e-38 | 1.009496e-37 |
... | ... | ... | ... | ... | ... | ... | ... |
907 | 7 | rest | Satellite cells | -3.256111 | -0.930071 | 6.016945e-03 | 2.858049e-02 |
908 | 7 | rest | Microfold cells | -3.369329 | -0.972097 | 4.600581e-03 | 2.497458e-02 |
909 | 7 | rest | Thymocytes | -3.385984 | -0.972476 | 4.545345e-03 | 2.497458e-02 |
910 | 7 | rest | Monocytes | -3.506819 | -1.017492 | 3.401401e-03 | 2.040840e-02 |
911 | 7 | rest | T cells | -4.158904 | -1.167707 | 1.189031e-03 | 8.471844e-03 |
912 rows × 7 columns
We can then extract the top 3 predicted cell types per cluster:
[12]:
n_ctypes = 3
ctypes_dict = df.groupby('group').head(n_ctypes).groupby('group')['names'].apply(lambda x: list(x)).to_dict()
ctypes_dict
[12]:
{'0': ['T cells', 'Nuocytes', 'T helper cells'],
'1': ['Acinar cells', 'Neutrophils', 'Macrophages'],
'2': ['B cells naive', 'B cells', 'B cells memory'],
'3': ['NK cells', 'T cells', 'Gamma delta T cells'],
'4': ['Gamma delta T cells', 'NK cells', 'Natural killer T cells'],
'5': ['Macrophages', 'Dendritic cells', 'Monocytes'],
'6': ['Dendritic cells', 'Acinar cells', 'Microfold cells'],
'7': ['Platelets', 'Hepatic stellate cells', 'Megakaryocytes']}
We can visualize the obtained top predicted cell types:
[13]:
sc.pl.matrixplot(acts, ctypes_dict, 'leiden', dendrogram=True,
colorbar_title='Z-scaled scores', vmin=-2, vmax=2, cmap='RdBu_r')
WARNING: dendrogram data not found (using key=dendrogram_leiden). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
From this plot we see that cluster 7 belongs to Platelets, cluster 4 appear to be NK cells, custers 0 and 3 might be T-cells, cluster 2 should be some sort of B cells and that clusters 6,5 and 1 belong to the myeloid lineage.
We can check individual cell types by plotting their distributions:
[14]:
sc.pl.violin(acts, keys=['T cells', 'B cells', 'Platelets', 'Monocytes'], groupby='leiden')
The final annotation should be done manually based on the assessment of the enrichment results. However, an automatic prediction can be made by assigning the top predicted cell type per cluster. This approach does not require expertise in the tissue being studied but can be prone to errors. Nonetheless it can be useful to generate a first draft, let’s try it:
[15]:
annotation_dict = df.groupby('group').head(1).set_index('group')['names'].to_dict()
annotation_dict
[15]:
{'0': 'T cells',
'1': 'Acinar cells',
'2': 'B cells naive',
'3': 'NK cells',
'4': 'Gamma delta T cells',
'5': 'Macrophages',
'6': 'Dendritic cells',
'7': 'Platelets'}
Once we have selected the top cell type we can finally annotate:
[16]:
# Add cell type column based on annotation
adata.obs['cell_type'] = [annotation_dict[clust] for clust in adata.obs['leiden']]
# Visualize
sc.pl.umap(adata, color='cell_type')
/home/badi/miniconda3/envs/dcp39/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(
Compared to the annotation obtained by the scanpy
tutorial, it is very similar but there are some errors, highlything the limitation of automatic annotation.
Note
Cell annotation should always be revised by an expert in the tissue of interest, this notebook only shows how to generate a first draft of it.