Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Wednesday, September 3, 2014

mapping gene ids with mygene

Mapping gene ids is one of the routine jobs for bioinformatics. I was aware of several ways to do it including Biomart.

Update on 10/30/14, a mygene bioconductor package is online http://bioconductor.org/packages/release/bioc/html/mygene.html

Recently I got to know mygene, a python wrapper for the mygene.info services to map gene ids.
I found it very handy to convert gene ids.  see a gist below.

#! /usr/bin/env python
# ID mapping using mygene
# https://pypi.python.org/pypi/mygene
# http://nbviewer.ipython.org/gist/newgene/6771106
# http://mygene-py.readthedocs.org/en/latest/
# 08/30/14
__author__ = 'tommy'
import mygene
import fileinput
import sys
mg = mygene.MyGeneInfo()
# mapping gene symbols to Entrez gene ids and Ensemble gene ids.
# fileinput will loop through all the lines in the input specified as file names given in command-line arguments,
# or the standard input if no arguments are provided.
# build a list from an input file with one gene name in each line
def get_gene_symbols():
gene_symbols = []
for line in fileinput.input():
gene_symbol = line.strip() # assume each line contains only one gene symbol
gene_symbols.append(gene_symbol)
fileinput.close()
return gene_symbols
Entrez_ids = mg.querymany(get_gene_symbols(), scopes='symbol', fields='entrezgene, ensembl.gene', species='human',
as_dataframe=True, verbose=False)
# set as_dataframe to True will return a pandas dataframe object, verbose=False suppress the messages like "finished".
# Entrez_ids.to_csv(sys.stdout, sep="\t") # write the dataframe to stdout, but will not have NaNs on the screen
# if no matches were found
sys.stdout.write(Entrez_ids.to_string()) # sys.stdout.write() expects the character buffer object
# Entrez_ids.to_csv("Entrez_ids.txt", sep="\t") # write the pandas dataframe to csv
To use it,  cat input.txt | python geneSymbol2Entrez.py > output.txt
or python geneSymbol2Entrez.py input.txt > output.txt  where input.txt contains one gene name in each line. pretty neat!

No comments:

Post a Comment