ICGRC Omics API Demo¶
This Jupyter notebook demonstrates the utility of the ICGRC Omics API (https://icgrc.info/api_doc) to serve multi-omics datasets from a Tripal database. The API is designed for merging multi-omics datasets, and from multiple data sources. The use-cases here include:
- Merging phenotype data from multiple sources, transforming them into compatiple units, and plotting their distribution
- Batch effect correction and imputation
- Weighted Gene Co-expression Network Analysis (WGCNA) to list top Trait-Gene pairs using Phenotype and Expression data
- Matrix eQTL to list top SNP-Gene pairs using Variant and Expression data
- mGWAS to list top SNP-Metabolite pairs using Variant and Phenotype data
- Combine results of eQTL, WGCNA and mGWAS to find shared genes and traits to trace multiple evidence paths.
This demo uses the omics_api library designed to interface with the ICGRC omics API web service. These python libraries are use in all applications: matplotlib,numpy, wand. Additional software need to be installed by the client like plink, R and and python packages used by specific use-cases described in the sections below.
Goal oriented notebooks
This notebook is a short demo more focused on the API features, the results presented may not be statistically robust. The following notebooks are goal-oriented examples with actual/reliable results.
- Analyses using single dataset for each tool run
- Comparision of different imputation and batch correction tools
- Batch run using large pre-downloaded datasets
- SNP cluster and sample Hemp/Drug information comparison using cannabinoid genes
- SNP cluster and sample Hemp/Drug information comparison using cannabinoid genes, unfiltered SNPs
- GWAS analysis of cannabinoids using IPK samples
%matplotlib inline
%load_ext autoreload
import os
import requests
from IPython.display import display, Image, FileLink
#from jupyter_datatables import init_datatables_mode
#from ipydatagrid import DataGrid
from itables import init_notebook_mode
#init_notebook_mode(all_interactive=True)
init_notebook_mode(connected=True)
import itables.options as itablesopt
# display table options
#itablesopt.lengthMenu = [2, 5, 10, 20, 50, 100, 200, 500]
#itablesopt.maxBytes = 10000 , 0 to disable
#itablesopt.maxRows
#itablesopt.maxColumns
#%load_ext jupyter_require
#%requirejs d3 https://d3js.org/d3.v5.min
#init_datatables_mode()
%autoreload 2
import omics_api
from omics_api import *
# Flags to emable example
others=True
PLOT_PHEN=others
WGCNA=others
GWAS=others
MATRIXEQT=others
VERBOSE=True
MERGE=others
REQUERY_URL=False
CHECK_BE=True # check and correct batch effect
# imputation strategy from https://doi.org/10.1038/s41598-023-30084-2
IMPUTE_M1=True # global mean
IMPUTE_M2=True # batch mean
DEBUGGNG=False
# strict setting
#maxp=1e-5
# threshold settings
maxpgwas=1e-20
maxpeqtl=1e-5
maxpwgcna=1e-5
mincorwgcna=0.7
cortopnwgcna=None
gstopnwgcna=None
top_dfgwas=[]
top_dfwgcna=[]
top_dfeqtl=[]
myunit='percent_of_dw'
#SHOW_ALL
#SHOW_DEBUG
#SHOW_ERRORONLY
if DEBUGGNG:
setVariables(loglevel=getSetting('SHOW_DEBUG'),keep_unit=myunit)
else:
setVariables(loglevel=getSetting('SHOW_ERRORONLY'),keep_unit=myunit)
#setVariables(loglevel=getSetting('SHOW_ERRORONLY'))
Multiple data sources¶
This demo plots dstribution of datasets from multiple sources
Avalable options for phenotypes
data = requests.get('https://icgrc.info/api/user/phenotype/dataset').json()
print('Phenotype datasets(phends) options')
display_table(data,columns=['phen_dataset','unit_name','samples_count','trait_count'])
Phenotype datasets(phends) options
phen_dataset | unit_name | samples_count | trait_count |
---|---|---|---|
Loading... (need help?) |
#api_url = ["https://icgrc.info/api/user/phenotype/all?phends=Booth2020&byacc=1&hassnp=1&with_stdunits=1","https://icgrc.info/api/user/phenotype/all?phends=Zager2019&private=1&byacc=1&hassnp=1&with_stdunits=1","https://icgrc.info/api/user/phenotype/all?phends=GloerfeltTarp2023&private=1&byacc=1&hassnp=1&with_stdunits=1","https://icgrc.info/api/user/phenotype/all?phends=Booth2020,Zager2019,GloerfeltTarp2023&private=1&byacc=1&hassnp=1&with_stdunits=1"]
#api_url = ["https://icgrc.info/api/user/phenotype/all?phends=Booth2020&with_stdunits=1","https://icgrc.info/api/user/phenotype/all?phends=Zager2019&private=1&with_stdunits=1","https://icgrc.info/api/user/phenotype/all?phends=GloerfeltTarp2023&with_stdunits=1","https://icgrc.info/api/user/phenotype/all?phends=Booth2020,Zager2019,GloerfeltTarp2023&with_stdunits=1"]
#label_url=['Booth2020','Zager2019','GloerfeltTarp2023','BGZphenotypes']
api_url = ["https://icgrc.info/api/user/phenotype/all?phends=Booth2020&with_stdunits=1","https://icgrc.info/api/user/phenotype/all?phends=Zager2019&private=1&with_stdunits=1","https://icgrc.info/api/user/phenotype/all?phends=Booth2020,Zager2019&with_stdunits=1"]
label_url=['Booth2020','Zager2019','BZphenotypes']
label2color=dict()
label2color['Azman2023']='red'
label2color['Booth2020']='blue'
label2color['GloerfeltTarp2023']='green'
label2color['Zager2019']='orange'
label2color['BZphenotypes']='yellow'
keep_unit='percent_of_dw'
#keep_unit='ug_per_gdw'
n_bins=10
icnt=0
dfs=[]
dfs_imputed_m2=[]
phenall=set()
keep_samples=set()
keep_phenotypes=set()
if PLOT_PHEN:
for iurl in api_url:
print(label_url[icnt])
df_raw= read_url(iurl, label_url[icnt],requery=REQUERY_URL)
if VERBOSE and icnt==0:
display('in original units')
display(drop_allnazero_rowcol(df_raw.loc[df_raw['datatype'].str.startswith('3 PHEN')]))
(c_converted_phenunits, c_converted_phenunits_values)=convert_units_to(df_raw,to_unit='percent_of_dw')
phenall.update(set(get_phenotypes(c_converted_phenunits_values)))
plot_histograms(c_converted_phenunits_values, 'changed_to_' + keep_unit, label_url[icnt]) #,label2color=label2color)
if VERBOSE and icnt==0:
display('coverted to ' + keep_unit)
display(drop_allnazero_rowcol(c_converted_phenunits.loc[c_converted_phenunits['datatype'].str.startswith('3 PHEN')]))
display('coverted to (values only) ' + keep_unit)
display(drop_allnazero_rowcol(c_converted_phenunits_values.loc[c_converted_phenunits_values['datatype'].str.startswith('3 PHEN')]))
dfs.append(c_converted_phenunits_values)
if IMPUTE_M2:
(df_corrected,dummy)=check_batcheffects(c_converted_phenunits_values, label_url[icnt], TYPE_PHEN, whiten=True, batch_id='phenotype_dataset',on_missing='mean',correct_BE=False) #properties=['phenotype_dataset']) #properties=['NCBI BioProject'])
dfs_imputed_m2.append(df_corrected)
icnt+=1
Booth2020
'in original units'
datatype | property | SAMN13750438 | SAMN13750439 | SAMN13750440 | SAMN13750441 | SAMN13750442 | SAMN13750443 | SAMN13750444 | SAMN13750445 | SAMN13750446 | SAMN13750447 | SAMN13750448 | SAMN13750449 | SAMN13750450 | SAMN13750451 | SAMN13750452 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
'coverted to percent_of_dw'
datatype | property | SAMN13750438 | SAMN13750439 | SAMN13750440 | SAMN13750441 | SAMN13750442 | SAMN13750443 | SAMN13750444 | SAMN13750445 | SAMN13750446 | SAMN13750447 | SAMN13750448 | SAMN13750449 | SAMN13750450 | SAMN13750451 | SAMN13750452 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
'coverted to (values only) percent_of_dw'
datatype | property | SAMN13750438 | SAMN13750439 | SAMN13750440 | SAMN13750441 | SAMN13750442 | SAMN13750443 | SAMN13750444 | SAMN13750445 | SAMN13750446 | SAMN13750447 | SAMN13750448 | SAMN13750449 | SAMN13750450 | SAMN13750451 | SAMN13750452 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |