Conversion to other organisms

Most of the prior knowledge stored inside Omnipath is derived from human data, therefore they use gene names. Despite this, using homology we can convert gene names to other organisms.

To showcase how to do it inside decoupler, we will load the MSigDB database and convert it into gene symbols for mouse and fly.

[1]:
import decoupler as dc

msigdb = dc.get_resource('MSigDB')
msigdb
[1]:
genesymbol collection geneset
0 MAFF chemical_and_genetic_perturbations BOYAULT_LIVER_CANCER_SUBCLASS_G56_DN
1 MAFF chemical_and_genetic_perturbations ELVIDGE_HYPOXIA_UP
2 MAFF chemical_and_genetic_perturbations NUYTTEN_NIPP1_TARGETS_DN
3 MAFF immunesigdb GSE17721_POLYIC_VS_GARDIQUIMOD_4H_BMDC_DN
4 MAFF chemical_and_genetic_perturbations SCHAEFFER_PROSTATE_DEVELOPMENT_12HR_UP
... ... ... ...
3838543 PRAMEF22 go_biological_process GOBP_POSITIVE_REGULATION_OF_CELL_POPULATION_PR...
3838544 PRAMEF22 go_biological_process GOBP_APOPTOTIC_PROCESS
3838545 PRAMEF22 go_biological_process GOBP_REGULATION_OF_CELL_DEATH
3838546 PRAMEF22 go_biological_process GOBP_NEGATIVE_REGULATION_OF_DEVELOPMENTAL_PROCESS
3838547 PRAMEF22 go_biological_process GOBP_NEGATIVE_REGULATION_OF_CELL_DEATH

3838548 rows × 3 columns

For this example we will filter by the hallmark gene sets collection:

[2]:
# Filter by hallmark
msigdb = msigdb[msigdb['collection']=='hallmark']

# Remove duplicated entries
msigdb = msigdb[~msigdb.duplicated(['geneset', 'genesymbol'])]
msigdb
[2]:
genesymbol collection geneset
233 MAFF hallmark HALLMARK_IL2_STAT5_SIGNALING
250 MAFF hallmark HALLMARK_COAGULATION
270 MAFF hallmark HALLMARK_HYPOXIA
373 MAFF hallmark HALLMARK_TNFA_SIGNALING_VIA_NFKB
377 MAFF hallmark HALLMARK_COMPLEMENT
... ... ... ...
1449668 STXBP1 hallmark HALLMARK_PANCREAS_BETA_CELLS
1450315 ELP4 hallmark HALLMARK_PANCREAS_BETA_CELLS
1450526 GCG hallmark HALLMARK_PANCREAS_BETA_CELLS
1450731 PCSK2 hallmark HALLMARK_PANCREAS_BETA_CELLS
1450916 PAX6 hallmark HALLMARK_PANCREAS_BETA_CELLS

7318 rows × 3 columns

Then, we can easily transform the obtained resource into mouse genes. Organisms can be defined by their common name, latin name or NCBI Taxonomy identifier.

Note

Translating to an organism for the first time might take a while (~ 15 minutes). Since the data is stored in cache, the next times is going to run faster. If you need to reset the cache, run rm -r .pypath/cache/.

[3]:
# Translate targets
mouse_msigdb = dc.translate_net(msigdb, target_organism = 'mouse', unique_by = ('geneset', 'genesymbol'))
mouse_msigdb
[2023-06-01 11:17:48] [curl] Module `pysftp` not available. Only downloading of a small number of resources relies on this module. Please install by PIP if it is necessary for you.
[3]:
genesymbol collection geneset
0 Maff hallmark HALLMARK_IL2_STAT5_SIGNALING
1 Maff hallmark HALLMARK_COAGULATION
2 Maff hallmark HALLMARK_HYPOXIA
3 Maff hallmark HALLMARK_TNFA_SIGNALING_VIA_NFKB
4 Maff hallmark HALLMARK_COMPLEMENT
... ... ... ...
7683 Stxbp1 hallmark HALLMARK_PANCREAS_BETA_CELLS
7684 Elp4 hallmark HALLMARK_PANCREAS_BETA_CELLS
7685 Gcg hallmark HALLMARK_PANCREAS_BETA_CELLS
7686 Pcsk2 hallmark HALLMARK_PANCREAS_BETA_CELLS
7687 Pax6 hallmark HALLMARK_PANCREAS_BETA_CELLS

7550 rows × 3 columns

Note that when performing homology convertion we might gain or lose some genes from one organism to another.

Let us try the fruit fly (7227) now:

[4]:
# Translate targets
fly_msigdb = dc.translate_net(msigdb, target_organism = 7227, unique_by = ('genesymbol', 'geneset'))
fly_msigdb
[4]:
genesymbol collection geneset
0 Eato hallmark HALLMARK_TNFA_SIGNALING_VIA_NFKB
1 Eato hallmark HALLMARK_PROTEIN_SECRETION
2 Eato hallmark HALLMARK_ADIPOGENESIS
3 Eato hallmark HALLMARK_BILE_ACID_METABOLISM
4 Eato hallmark HALLMARK_INFLAMMATORY_RESPONSE
... ... ... ...
6594 sut4 hallmark HALLMARK_PANCREAS_BETA_CELLS
6595 G6P hallmark HALLMARK_PANCREAS_BETA_CELLS
6596 Cbp53E hallmark HALLMARK_PANCREAS_BETA_CELLS
6598 Rop hallmark HALLMARK_PANCREAS_BETA_CELLS
6599 amon hallmark HALLMARK_PANCREAS_BETA_CELLS

5866 rows × 3 columns

The translate_net function provides finer control, but in most cases it’s enough to pass the name of the desired organism to the functions that download the data:

[5]:
spw = dc.get_resource('SignaLink_pathway', organism = 'rat')
spw
[5]:
genesymbol pathway
0 Tab2 Toll-like receptor
1 Tab2 Innate immune pathways
2 Tab2 JAK/STAT
3 Tab2 Receptor tyrosine kinase
4 Map3k7 TNF pathway
... ... ...
937 Sit1 T-cell receptor
938 Nfatc2 B-cell receptor
939 Nfatc2 T-cell receptor
940 Rasgrp1 B-cell receptor
941 Rasgrp1 T-cell receptor

942 rows × 2 columns

PROGENy and CollecTRI have their own dedicated functions which work a similar way:

[6]:
dc.get_progeny(organism = 'Mus musculus')
[6]:
source target weight p_value
0 Androgen Tmprss2 11.490631 0.000000e+00
1 Androgen Nkx3-1 10.622551 2.242078e-44
2 NFkB Nkx3-1 2.372983 5.589476e-32
3 TNFa Nkx3-1 2.871633 1.044050e-27
4 Androgen Mboat2 10.472733 4.624285e-44
... ... ... ... ...
1389 p53 Carns1 4.538734 4.730570e-13
1390 p53 Ccdc150 -3.174527 7.396252e-13
1391 p53 Trem2 4.101937 9.739648e-13
1392 p53 Gdf9 3.355741 1.087433e-12
1393 p53 Nhlh2 2.201638 1.651582e-12

1394 rows × 4 columns

[7]:
dc.get_collectri(organism = 'mouse')
[7]:
source target weight
0 Myc Tert 1
1 Spi1 Bglap2 1
2 Spi1 Bglap 1
3 Spi1 Bglap3 1
4 Smad3 Jun 1
... ... ... ...
38660 Runx1 Lcp2 1
38661 Runx1 Prr5l 1
38662 Twist1 Gli1 1
38663 Usf1 Nup188 1
38664 Znf148 Rnls 1

38665 rows × 3 columns