Cell type annotation from marker genes

In single-cell, we have no prior information of which cell type each cell belongs. To assign cell type labels, we first project all cells in a shared embedded space, then we find communities of cells that show a similar transcription profile and finally we check what cell type specific markers are expressed. If more than one marker gene is available, statistical methods can be used to test if a set of markers is enriched in a given cell population.

In this notebook we showcase how to use decoupler for cell type annotation with the 3k PBMCs 10X data-set. The data consists of 3k PBMCs from a Healthy Donor and is freely available from 10x Genomics here from this webpage

Note

This tutorial assumes that you already know the basics of decoupler. Else, check out the Usage tutorial first.

Loading packages

First, we need to load the relevant packages, scanpy to handle scRNA-seq data and decoupler to use statistical methods.

[1]:
import scanpy as sc
import decoupler as dc

# Only needed for visualization:
import matplotlib.pyplot as plt
import seaborn as sns

Single-cell processing

Loading the data-set

We can download the data easily using scanpy:

[2]:
adata = sc.datasets.pbmc3k()
adata
[2]:
AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

QC, projection and clustering

Here we follow the standard pre-processing steps as described in the scanpy vignette. These steps carry out the selection and filtration of cells based on quality control metrics, the data normalization and scaling, and the detection of highly variable features.

Note

This is just an example, these steps should change depending on the data.

[3]:
# Basic filtering
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

# Annotate the group of mitochondrial genes as 'mt'
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

# Filter cells following standard QC criteria.
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :]

# Normalize the data
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# Identify the 2000 most highly variable genes
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)

# Filter higly variable genes
adata.raw = adata
adata = adata[:, adata.var.highly_variable]

# Regress and scale the data
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
sc.pp.scale(adata, max_value=10)
/home/badi/miniconda3/envs/decoupler/lib/python3.8/site-packages/scanpy/preprocessing/_normalization.py:155: UserWarning: Revieved a view of an AnnData. Making a copy.
  view_to_actual(adata)

Then we group cells based on the similarity of their transcription profiles. To visualize the communities we perform UMAP reduction.

[4]:
# Generate PCA features
sc.tl.pca(adata, svd_solver='arpack')

# Compute distances in the PCA space, and find cell neighbors
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

# Generate UMAP features
sc.tl.umap(adata)

# Run leiden clustering algorithm
sc.tl.leiden(adata)

# Visualize
sc.pl.umap(adata, color='leiden', title='RNA UMAP',
           frameon=False, legend_fontweight='normal', legend_fontsize=15)
../_images/notebooks_cell_annotation_9_0.png

At this stage, we have identified communities of cells that show a similar transcriptomic profile, and we would like to know to which cell type they probably belong.

Marker genes

To annotate single cell clusters, one can use cell type specific marker genes. These are genes that are mainly expressed exclusively by a specific cell type, making them useful to distinguish heterogeneous groups of cells. Marker genes were discovered and annotated in previous studies and there are some resources that collect and curate them.

Omnipath is one of the largest available databases of curated prior knowledge. Among its resources, there is PanglaoDB, a database of cell type markers, which can be easily accessed using a wrapper to Omnipath from decoupler.

Note

If you encounter bugs with Omnipath, sometimes is good to just remove its cache using: rm $HOME/.cache/omnipathdb/*

[5]:
# Query Omnipath and get PanglaoDB
markers = dc.get_resource('PanglaoDB')
markers
[5]:
genesymbol canonical_marker cell_type germ_layer human human_sensitivity human_specificity mouse mouse_sensitivity mouse_specificity ncbi_tax_id organ ubiquitiousness
0 CTRB1 False Enterocytes Endoderm True 0.0 0.00439422 True 0.00331126 0.0204803 9606 GI tract 0.017
1 CTRB1 True Acinar cells Endoderm True 1.0 0.000628931 True 0.957143 0.0159201 9606 Pancreas 0.017
2 KLK1 True Acinar cells Endoderm True 0.833333 0.00503145 True 0.314286 0.0128263 9606 Pancreas 0.013
3 KLK1 False Goblet cells Endoderm True 0.588235 0.00503937 True 0.903226 0.0124084 9606 GI tract 0.013
4 KLK1 False Epithelial cells Mesoderm True 0.0 0.00823306 True 0.225806 0.0137585 9606 Epithelium 0.013
... ... ... ... ... ... ... ... ... ... ... ... ... ...
8472 SLC14A1 True Urothelial cells Mesoderm True 0.0 0.0181704 True 0.0 0.0 9606 Urinary bladder 0.008
8473 UPK3A True Urothelial cells Mesoderm True 0.0 0.0 True 0.0 0.0 9606 Urinary bladder 0.0
8474 UPK1A True Urothelial cells Mesoderm True 0.0 0.0 True 0.0 0.0 9606 Urinary bladder 0.0
8475 UPK2 True Urothelial cells Mesoderm True 0.0 0.0 True 0.0 0.0 9606 Urinary bladder 0.0
8476 UPK3B True Urothelial cells Mesoderm True 0.0 0.0 True 0.0 0.0 9606 Urinary bladder 0.0

8477 rows × 13 columns

Since our data-set is from human cells, and we want best quality of the markers, we can filter by canonical_marker and human:

[6]:
# Filter by canonical_marker and human
markers = markers[(markers['human']=='True')&(markers['canonical_marker']=='True')]

# Remove duplicated entries
markers = markers[~markers.duplicated(['cell_type', 'genesymbol'])]
markers
[6]:
genesymbol canonical_marker cell_type germ_layer human human_sensitivity human_specificity mouse mouse_sensitivity mouse_specificity ncbi_tax_id organ ubiquitiousness
1 CTRB1 True Acinar cells Endoderm True 1.0 0.000628931 True 0.957143 0.0159201 9606 Pancreas 0.017
2 KLK1 True Acinar cells Endoderm True 0.833333 0.00503145 True 0.314286 0.0128263 9606 Pancreas 0.013
5 KLK1 True Principal cells Mesoderm True 0.0 0.00814536 True 0.285714 0.0140583 9606 Kidney 0.013
7 KLK1 True Plasmacytoid dendritic cells Mesoderm True 0.0 0.00820189 True 1.0 0.0129136 9606 Immune system 0.013
8 KLK1 True Endothelial cells Mesoderm True 0.0 0.00841969 True 0.0 0.0149153 9606 Vasculature 0.013
... ... ... ... ... ... ... ... ... ... ... ... ... ...
8472 SLC14A1 True Urothelial cells Mesoderm True 0.0 0.0181704 True 0.0 0.0 9606 Urinary bladder 0.008
8473 UPK3A True Urothelial cells Mesoderm True 0.0 0.0 True 0.0 0.0 9606 Urinary bladder 0.0
8474 UPK1A True Urothelial cells Mesoderm True 0.0 0.0 True 0.0 0.0 9606 Urinary bladder 0.0
8475 UPK2 True Urothelial cells Mesoderm True 0.0 0.0 True 0.0 0.0 9606 Urinary bladder 0.0
8476 UPK3B True Urothelial cells Mesoderm True 0.0 0.0 True 0.0 0.0 9606 Urinary bladder 0.0

5180 rows × 13 columns

For this example we will use these markers, but any collection of genes could be used. To see the list of available resources inside Omnipath, run dc.show_resources()

Enrichment with Over Representation Analysis

To test if a collection of genes are enriched in a given cell type, we will run Over Representation Analysis (ora), also known as Fisher exact test, but we could do it with any of the other available methods in decoupler.

ora selects the top 5% expressed genes for each cell, and tests if a collection of marker genes are enriched in the top expressed collection.

To run decoupler methods, we need an input matrix (mat), an input prior knowledge network/resource (net), and the name of the columns of net that we want to use.

[7]:
dc.run_ora(mat=adata, net=markers, source='cell_type', target='genesymbol', min_n=3, verbose=True)
1 features of mat are empty, they will be removed.
Running ora on mat with 2638 samples and 13713 targets for 114 sources.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2638/2638 [00:03<00:00, 862.79it/s]

The obtained scores (-log10(p-value))(ora_estimate) and p-values (ora_pvals) are stored in the .obsm key:

[8]:
adata.obsm['ora_estimate']
[8]:
source Acinar cells Adipocytes Airway goblet cells Alpha cells Alveolar macrophages Astrocytes B cells B cells memory B cells naive Basophils ... T cells T follicular helper cells T helper cells T regulatory cells Tanycytes Taste receptor cells Thymocytes Trophoblast cells Tuft cells Urothelial cells
AAACATACAACCAC-1 0.496369 0.093412 -0.000000 1.443226 0.885021 0.581353 4.624325 1.459548 1.399835 -0.000000 ... 13.523759 -0.0 1.036748 -0.000000 -0.000000 -0.0 3.325696 -0.000000 1.036748 -0.0
AAACATTGAGCTAC-1 0.496369 1.114303 -0.000000 0.569257 -0.000000 1.119542 5.592250 8.719924 15.198236 0.306499 ... 2.431181 -0.0 0.389675 -0.000000 -0.000000 -0.0 0.827013 0.885021 2.858756 -0.0
AAACATTGATCAGC-1 0.496369 0.316389 -0.000000 1.443226 0.885021 1.119542 0.903197 0.966789 2.580070 1.087856 ... 8.800720 -0.0 1.036748 -0.000000 0.663965 -0.0 2.367740 -0.000000 1.871866 -0.0
AAACCGTGCTTCCG-1 1.278757 1.659624 -0.000000 1.443226 0.885021 0.195977 0.903197 0.074112 0.923023 0.306499 ... 0.718715 -0.0 -0.000000 -0.000000 -0.000000 -0.0 0.298770 -0.000000 0.389675 -0.0
AAACCGTGTATGCG-1 -0.000000 -0.000000 0.885021 0.569257 0.885021 -0.000000 0.903197 0.074112 0.246287 1.623875 ... 3.162941 -0.0 1.036748 -0.000000 0.663965 -0.0 0.298770 -0.000000 0.389675 -0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
TTTCGAACTCTCAT-1 1.278757 0.661862 -0.000000 1.443226 0.885021 0.581353 1.469558 0.262017 1.399835 1.087856 ... 1.198738 -0.0 0.389675 0.569257 0.663965 -0.0 2.367740 -0.000000 1.036748 -0.0
TTTCTACTGAGGCA-1 -0.000000 -0.000000 -0.000000 1.443226 0.885021 0.195977 2.135997 1.459548 1.954926 1.087856 ... 1.198738 -0.0 -0.000000 0.569257 0.663965 -0.0 0.827013 -0.000000 -0.000000 -0.0
TTTCTACTTCCTCG-1 -0.000000 -0.000000 0.885021 1.443226 -0.000000 0.195977 10.042270 10.852223 15.198236 0.089844 ... 1.773449 -0.0 0.389675 -0.000000 -0.000000 -0.0 3.325696 -0.000000 1.036748 -0.0
TTTGCATGAGAGGC-1 -0.000000 0.316389 0.885021 0.569257 -0.000000 0.195977 6.621109 7.714603 12.817177 0.089844 ... 1.198738 -0.0 0.389675 0.569257 -0.000000 -0.0 1.527899 -0.000000 1.036748 -0.0
TTTGCATGCCTCAC-1 0.496369 0.093412 -0.000000 -0.000000 0.885021 -0.000000 0.454158 0.966789 1.399835 0.306499 ... 7.731358 -0.0 1.036748 -0.000000 1.657599 -0.0 1.527899 -0.000000 1.871866 -0.0

2638 rows × 114 columns

Visualization

To visualize the obtianed scores, we can re-use many of scanpy’s plotting functions. First though, we need to extract them from the adata object.

[9]:
acts = dc.get_acts(adata, obsm_key='ora_estimate')
acts
[9]:
AnnData object with n_obs × n_vars = 2638 × 114
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden'
    uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors'
    obsm: 'X_pca', 'X_umap', 'ora_estimate', 'ora_pvals'

dc.get_acts returns a new AnnData object which holds the obtained activities in its .X attribute, allowing us to re-use many scanpy functions, for example:

[10]:
sc.pl.umap(acts, color='NK cells')
../_images/notebooks_cell_annotation_24_0.png

The cells highlighted seem to be enriched by NK cell marker genes.

Annotation

With decoupler we can also see what is the mean enrichment per group, in our case the cluster groups obtained from clustering:

[11]:
mean_enr = dc.summarize_acts(acts, groupby='leiden', min_std=1)
mean_enr
[11]:
B cells B cells memory B cells naive Dendritic cells Gamma delta T cells Macrophages Megakaryocytes Microglia Monocytes NK cells Neutrophils Plasma cells Platelets T cells
0 1.674612 1.206118 2.126226 3.089706 1.282204 0.656916 0.558249 0.824813 1.969143 2.502024 0.733604 1.098518 2.097416 8.560976
1 1.491951 0.537041 1.358659 12.386445 0.392409 5.205774 0.994419 3.438604 6.913848 0.876738 4.510537 1.040281 2.978631 1.319387
2 6.927173 7.202431 12.459538 7.860511 0.424341 0.726596 0.453553 1.126456 1.839738 1.016634 0.781631 4.108602 2.022158 2.391670
3 1.528638 1.003478 1.688443 3.882292 5.015036 1.082678 0.747354 1.013499 1.990285 8.912252 0.738189 1.171692 2.690642 10.510677
4 1.058014 0.690444 1.214375 3.754120 12.026393 1.798656 1.006925 1.552309 2.541472 16.074585 1.316749 1.296880 2.992357 6.312927
5 2.046917 0.715149 1.472404 12.163185 0.667618 5.303366 0.701650 3.465839 6.493895 0.895001 2.756551 0.970993 3.152135 1.381764
6 1.753542 0.827478 2.325148 13.919129 0.783363 3.639544 0.616080 2.822582 4.251034 0.765348 2.401234 1.879233 2.796829 1.497420
7 0.801249 0.592690 1.186940 3.111120 0.701021 1.905558 5.641654 1.086733 0.595006 0.993758 1.055535 1.049253 23.893217 1.044401

We can visualize which cell types are more enriched for each cluster using seaborn:

[12]:
sns.clustermap(mean_enr, xticklabels=mean_enr.columns, cmap='viridis')
plt.show()
../_images/notebooks_cell_annotation_29_0.png

From this plot we see that cluster 7 belongs to Platelets, cluster 4 appear to be NK cells, custers 0 and 3 might be T-cells, cluster 2 should be some sort of B cells and that clusters 6,5 and 1 are Dendritic cells or Monocytes

We can also get what is the maximum value per group:

[13]:
annotation_dict = dc.assign_groups(mean_enr)
annotation_dict
[13]:
{'0': 'T cells',
 '1': 'Dendritic cells',
 '2': 'B cells naive',
 '3': 'T cells',
 '4': 'NK cells',
 '5': 'Dendritic cells',
 '6': 'Dendritic cells',
 '7': 'Platelets'}

Which we can use to annotate our data-set:

[14]:
# Add cell type column based on annotation
adata.obs['cell_type'] = [annotation_dict[clust] for clust in adata.obs['leiden']]

# Visualize
sc.pl.umap(adata, color='cell_type')
/home/badi/miniconda3/envs/decoupler/lib/python3.8/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'cell_type' as categorical
../_images/notebooks_cell_annotation_33_1.png

Compared to the annotation obtained by the scanpy tutorial, it is very simmilar.

image0

Note

Cell annotation should always be revised by an expert in the tissue of interest, this notebook only shows how to generate a first draft of it.