Conversion to other organisms

Most of the prior knowledge stored inside Omnipath is derived from human data, therefore they use gene names. Despite this, using homology we can convert gene names to other organisms.

To showcase how to do it inside decoupler, we will load the MSigDB database and convert it into gene symbols for mouse and fly.

[1]:
import decoupler as dc

msigdb = dc.get_resource('MSigDB')
msigdb
[1]:
genesymbol collection geneset
0 MSC oncogenic_signatures PKCA_DN.V1_DN
1 MSC mirna_targets MIR12123
2 MSC chemical_and_genetic_perturbations NIKOLSKY_BREAST_CANCER_8Q12_Q22_AMPLICON
3 MSC immunologic_signatures GSE32986_UNSTIM_VS_GMCSF_AND_CURDLAN_LOWDOSE_S...
4 MSC chemical_and_genetic_perturbations BENPORATH_PRC2_TARGETS
... ... ... ...
2407729 OR2W5P immunologic_signatures GSE22601_DOUBLE_NEGATIVE_VS_CD8_SINGLE_POSITIV...
2407730 OR2W5P immunologic_signatures KANNAN_BLOOD_2012_2013_TIV_AGE_65PLS_REVACCINA...
2407731 OR52L2P immunologic_signatures GSE22342_CD11C_HIGH_VS_LOW_DECIDUAL_MACROPHAGE...
2407732 CSNK2A3 immunologic_signatures OCONNOR_PBMC_MENVEO_ACWYVAX_AGE_30_70YO_7DY_AF...
2407733 AQP12B immunologic_signatures MATSUMIYA_PBMC_MODIFIED_VACCINIA_ANKARA_VACCIN...

2407734 rows × 3 columns

For this example we will filter by the hallmark gene sets collection:

[2]:
# Filter by hallmark
msigdb = msigdb[msigdb['collection']=='hallmark']

# Remove duplicated entries
msigdb = msigdb[~msigdb.duplicated(['geneset', 'genesymbol'])]
msigdb
[2]:
genesymbol collection geneset
11 MSC hallmark HALLMARK_TNFA_SIGNALING_VIA_NFKB
149 ICOSLG hallmark HALLMARK_TNFA_SIGNALING_VIA_NFKB
223 ICOSLG hallmark HALLMARK_INFLAMMATORY_RESPONSE
270 ICOSLG hallmark HALLMARK_ALLOGRAFT_REJECTION
398 FOSL2 hallmark HALLMARK_HYPOXIA
... ... ... ...
878342 FOXO1 hallmark HALLMARK_PANCREAS_BETA_CELLS
878418 GCG hallmark HALLMARK_PANCREAS_BETA_CELLS
878512 PDX1 hallmark HALLMARK_PANCREAS_BETA_CELLS
878605 INS hallmark HALLMARK_PANCREAS_BETA_CELLS
878785 SRP9 hallmark HALLMARK_PANCREAS_BETA_CELLS

7318 rows × 3 columns

Then, we can easily transform the obtained resource into mouse genes. Organisms can be defined by their common name, latin name or NCBI Taxonomy identifier.

Note

Translating to an organism for the first time might take a while (~ 15 minutes). Since the data is stored in cache, the next times is going to run faster. If you need to reset the cache, run rm -r .pypath/cache/.

[3]:
# Translate targets
mouse_msigdb = dc.translate_net(msigdb, target_organism = 'mouse', unique_by = ('geneset', 'genesymbol'))
mouse_msigdb
[2022-11-28 18:51:30] [curl] Module `pysftp` not available. Only downloading of a small number of resources relies on this module. Please install by PIP if it is necessary for you.
[3]:
genesymbol collection geneset
0 Msc hallmark HALLMARK_TNFA_SIGNALING_VIA_NFKB
1 Fosl2 hallmark HALLMARK_HYPOXIA
2 Fosl2 hallmark HALLMARK_TNFA_SIGNALING_VIA_NFKB
3 Relb hallmark HALLMARK_TNFA_SIGNALING_VIA_NFKB
4 Plau hallmark HALLMARK_TNFA_SIGNALING_VIA_NFKB
... ... ... ...
7684 Gcg hallmark HALLMARK_PANCREAS_BETA_CELLS
7685 Pdx1 hallmark HALLMARK_PANCREAS_BETA_CELLS
7686 Ins1 hallmark HALLMARK_PANCREAS_BETA_CELLS
7687 Ins2 hallmark HALLMARK_PANCREAS_BETA_CELLS
7688 Srp9 hallmark HALLMARK_PANCREAS_BETA_CELLS

7551 rows × 3 columns

Note that when performing homology convertion we might gain or lose some genes from one organism to another.

Let us try the fruit fly (7227) now:

[4]:
# Translate targets
fly_msigdb = dc.translate_net(msigdb, target_organism = 7227, unique_by = ('genesymbol', 'geneset'))
fly_msigdb
[4]:
genesymbol collection geneset
0 CG12648 hallmark HALLMARK_TNFA_SIGNALING_VIA_NFKB
1 HLH54F hallmark HALLMARK_TNFA_SIGNALING_VIA_NFKB
2 twi hallmark HALLMARK_TNFA_SIGNALING_VIA_NFKB
3 Hand hallmark HALLMARK_TNFA_SIGNALING_VIA_NFKB
4 dl hallmark HALLMARK_TNFA_SIGNALING_VIA_NFKB
... ... ... ...
6594 Dmel\CG17930 hallmark HALLMARK_PANCREAS_BETA_CELLS
6598 G6P hallmark HALLMARK_PANCREAS_BETA_CELLS
6599 amon hallmark HALLMARK_PANCREAS_BETA_CELLS
6600 Cbp53E hallmark HALLMARK_PANCREAS_BETA_CELLS
6601 Srp9 hallmark HALLMARK_PANCREAS_BETA_CELLS

5868 rows × 3 columns

The translate_net function provides finer control, but in most cases it’s enough to pass the name of the desired organism to the functions that download the data:

[5]:
spw = dc.get_resource('SignaLink_pathway', organism = 'rat')
spw
[5]:
genesymbol pathway
0 Tab2 JAK/STAT
1 Tab2 Receptor tyrosine kinase
2 Tab2 Toll-like receptor
3 Tab2 Innate immune pathways
4 Map3k7 Toll-like receptor
... ... ...
937 Sit1 T-cell receptor
938 Nfatc2 T-cell receptor
939 Nfatc2 B-cell receptor
940 Rasgrp1 T-cell receptor
941 Rasgrp1 B-cell receptor

942 rows × 2 columns

PROGENy and DoRothEA have their own dedicated functions which work a similar way:

[6]:
dc.get_progeny(organism = 'Mus musculus')
[6]:
source target weight p_value
0 Androgen Tmprss2 11.490631 0.000000e+00
1 Androgen Nkx3-1 10.622551 2.242078e-44
2 NFkB Nkx3-1 2.372983 5.589476e-32
3 TNFa Nkx3-1 2.871633 1.044050e-27
4 Androgen Mboat2 10.472733 4.624285e-44
... ... ... ... ...
1431 p53 Carns1 4.538734 4.730570e-13
1432 p53 Ccdc150 -3.174527 7.396252e-13
1433 p53 Trem2 4.101937 9.739648e-13
1434 p53 Gdf9 3.355741 1.087433e-12
1435 p53 Nhlh2 2.201638 1.651582e-12

1395 rows × 4 columns

[7]:
dc.get_dorothea(organism = 'mouse')
[7]:
source confidence target weight
0 E2f4 A Mycl 1.000000
1 Tp53 A Ogg1 1.000000
2 E2f4 A Bach1 1.000000
3 Hif1a A Mif 1.000000
4 E2f4 A Aurkb 1.000000
... ... ... ... ...
30885 Irf4 C Jak1 0.333333
30886 Irf4 C Il16 0.333333
30887 Irf4 C Ikzf3 0.333333
30888 Irf4 C Lpp 0.333333
30889 Znf740 C Znf687 0.333333

30890 rows × 4 columns