Conversion to other organisms
Most of the prior knowledge stored inside Omnipath
is derived from human data, therefore they use gene names. Despite this, using homology we can convert gene names to other organisms.
To showcase how to do it inside decoupler
, we will load the MSigDB
database and convert it into gene symbols for mouse and fly.
[1]:
import decoupler as dc
msigdb = dc.get_resource('MSigDB')
msigdb
[1]:
genesymbol | collection | geneset | |
---|---|---|---|
0 | MAFF | chemical_and_genetic_perturbations | BOYAULT_LIVER_CANCER_SUBCLASS_G56_DN |
1 | MAFF | chemical_and_genetic_perturbations | ELVIDGE_HYPOXIA_UP |
2 | MAFF | chemical_and_genetic_perturbations | NUYTTEN_NIPP1_TARGETS_DN |
3 | MAFF | immunesigdb | GSE17721_POLYIC_VS_GARDIQUIMOD_4H_BMDC_DN |
4 | MAFF | chemical_and_genetic_perturbations | SCHAEFFER_PROSTATE_DEVELOPMENT_12HR_UP |
... | ... | ... | ... |
3838543 | PRAMEF22 | go_biological_process | GOBP_POSITIVE_REGULATION_OF_CELL_POPULATION_PR... |
3838544 | PRAMEF22 | go_biological_process | GOBP_APOPTOTIC_PROCESS |
3838545 | PRAMEF22 | go_biological_process | GOBP_REGULATION_OF_CELL_DEATH |
3838546 | PRAMEF22 | go_biological_process | GOBP_NEGATIVE_REGULATION_OF_DEVELOPMENTAL_PROCESS |
3838547 | PRAMEF22 | go_biological_process | GOBP_NEGATIVE_REGULATION_OF_CELL_DEATH |
3838548 rows × 3 columns
For this example we will filter by the hallmark
gene sets collection:
[2]:
# Filter by hallmark
msigdb = msigdb[msigdb['collection']=='hallmark']
# Remove duplicated entries
msigdb = msigdb[~msigdb.duplicated(['geneset', 'genesymbol'])]
msigdb
[2]:
genesymbol | collection | geneset | |
---|---|---|---|
233 | MAFF | hallmark | HALLMARK_IL2_STAT5_SIGNALING |
250 | MAFF | hallmark | HALLMARK_COAGULATION |
270 | MAFF | hallmark | HALLMARK_HYPOXIA |
373 | MAFF | hallmark | HALLMARK_TNFA_SIGNALING_VIA_NFKB |
377 | MAFF | hallmark | HALLMARK_COMPLEMENT |
... | ... | ... | ... |
1449668 | STXBP1 | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
1450315 | ELP4 | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
1450526 | GCG | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
1450731 | PCSK2 | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
1450916 | PAX6 | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
7318 rows × 3 columns
Then, we can easily transform the obtained resource into mouse genes. Organisms can be defined by their common name, latin name or NCBI Taxonomy identifier.
Note
Translating to an organism for the first time might take a while (~ 15 minutes). Since the data is stored in cache, the next times is going to run faster. If you need to reset the cache, run rm -r .pypath/cache/
.
[3]:
# Translate targets
mouse_msigdb = dc.translate_net(msigdb, target_organism = 'mouse', unique_by = ('geneset', 'genesymbol'))
mouse_msigdb
[2023-06-01 11:17:48] [curl] Module `pysftp` not available. Only downloading of a small number of resources relies on this module. Please install by PIP if it is necessary for you.
[3]:
genesymbol | collection | geneset | |
---|---|---|---|
0 | Maff | hallmark | HALLMARK_IL2_STAT5_SIGNALING |
1 | Maff | hallmark | HALLMARK_COAGULATION |
2 | Maff | hallmark | HALLMARK_HYPOXIA |
3 | Maff | hallmark | HALLMARK_TNFA_SIGNALING_VIA_NFKB |
4 | Maff | hallmark | HALLMARK_COMPLEMENT |
... | ... | ... | ... |
7683 | Stxbp1 | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
7684 | Elp4 | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
7685 | Gcg | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
7686 | Pcsk2 | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
7687 | Pax6 | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
7550 rows × 3 columns
Note that when performing homology convertion we might gain or lose some genes from one organism to another.
Let us try the fruit fly (7227
) now:
[4]:
# Translate targets
fly_msigdb = dc.translate_net(msigdb, target_organism = 7227, unique_by = ('genesymbol', 'geneset'))
fly_msigdb
[4]:
genesymbol | collection | geneset | |
---|---|---|---|
0 | Eato | hallmark | HALLMARK_TNFA_SIGNALING_VIA_NFKB |
1 | Eato | hallmark | HALLMARK_PROTEIN_SECRETION |
2 | Eato | hallmark | HALLMARK_ADIPOGENESIS |
3 | Eato | hallmark | HALLMARK_BILE_ACID_METABOLISM |
4 | Eato | hallmark | HALLMARK_INFLAMMATORY_RESPONSE |
... | ... | ... | ... |
6594 | sut4 | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
6595 | G6P | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
6596 | Cbp53E | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
6598 | Rop | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
6599 | amon | hallmark | HALLMARK_PANCREAS_BETA_CELLS |
5866 rows × 3 columns
The translate_net
function provides finer control, but in most cases it’s enough to pass the name of the desired organism to the functions that download the data:
[5]:
spw = dc.get_resource('SignaLink_pathway', organism = 'rat')
spw
[5]:
genesymbol | pathway | |
---|---|---|
0 | Tab2 | Toll-like receptor |
1 | Tab2 | Innate immune pathways |
2 | Tab2 | JAK/STAT |
3 | Tab2 | Receptor tyrosine kinase |
4 | Map3k7 | TNF pathway |
... | ... | ... |
937 | Sit1 | T-cell receptor |
938 | Nfatc2 | B-cell receptor |
939 | Nfatc2 | T-cell receptor |
940 | Rasgrp1 | B-cell receptor |
941 | Rasgrp1 | T-cell receptor |
942 rows × 2 columns
PROGENy and CollecTRI have their own dedicated functions which work a similar way:
[6]:
dc.get_progeny(organism = 'Mus musculus')
[6]:
source | target | weight | p_value | |
---|---|---|---|---|
0 | Androgen | Tmprss2 | 11.490631 | 0.000000e+00 |
1 | Androgen | Nkx3-1 | 10.622551 | 2.242078e-44 |
2 | NFkB | Nkx3-1 | 2.372983 | 5.589476e-32 |
3 | TNFa | Nkx3-1 | 2.871633 | 1.044050e-27 |
4 | Androgen | Mboat2 | 10.472733 | 4.624285e-44 |
... | ... | ... | ... | ... |
1389 | p53 | Carns1 | 4.538734 | 4.730570e-13 |
1390 | p53 | Ccdc150 | -3.174527 | 7.396252e-13 |
1391 | p53 | Trem2 | 4.101937 | 9.739648e-13 |
1392 | p53 | Gdf9 | 3.355741 | 1.087433e-12 |
1393 | p53 | Nhlh2 | 2.201638 | 1.651582e-12 |
1394 rows × 4 columns
[7]:
dc.get_collectri(organism = 'mouse')
[7]:
source | target | weight | |
---|---|---|---|
0 | Myc | Tert | 1 |
1 | Spi1 | Bglap2 | 1 |
2 | Spi1 | Bglap | 1 |
3 | Spi1 | Bglap3 | 1 |
4 | Smad3 | Jun | 1 |
... | ... | ... | ... |
38660 | Runx1 | Lcp2 | 1 |
38661 | Runx1 | Prr5l | 1 |
38662 | Twist1 | Gli1 | 1 |
38663 | Usf1 | Nup188 | 1 |
38664 | Znf148 | Rnls | 1 |
38665 rows × 3 columns