Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Friday, September 26, 2014

everyone will write a blog post on dplyr

dply is a nice R package to manipulate big data. There are several good tutorials I came across:

http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/
http://stat545-ubc.github.io/block009_dplyr-intro.html
http://stat545-ubc.github.io/bit001_dplyr-cheatsheet.html
http://gettinggeneticsdone.blogspot.com/2014/08/do-your-data-janitor-work-like-boss.html

I followed the example in the early edition of Bioinformatics Data Skills on the dplyr part and put a gist below. The example used dplyr to manipulate and summarize the human annotation gff file downloaded here http://useast.ensembl.org/info/data/ftp/index.html.

I downloaded the ensemble 74 build which is different from the 75 build in the example of the book.
I believe the gencode v19 annotation file is based on ensemble 74.

I also included comments in the R code describing how to do the same job with linux command line.
library(dplyr)
setwd("/home/tommy/annotations/human/ensemble/")
# set the colClasses for faster reading in the data
gtf_cols <- c(seqname="factor", source="factor", feature="factor",
start="integer", end="integer", score="character",
strand="factor", frame="factor", attribute="character")
hs_gtf <- read.delim('Homo_sapiens.GRCh37.74.gtf.gz', header=FALSE,
col.names=names(gtf_cols), comment.char="#")
# load all the data into memory, this gz file is around 28Mb, if gunzip it will be around 900Mb
# very slow, it took me 2-3 mins to finish loading.....try data.table. usually I work with gtf file
# on command line. of course, specialized tools such as python library HTSeq and gffutils, bioawk etc are
# also existed there for more complicated analysis.
head(hs_gtf)
dim(hs_gtf) # how many rows and columns 2 million rows
hs_df<- tbl_df(hs_gtf) #convert to a local dataframe, when you print, it will not
# print out all the rows. similar to head()
head(hs_df$attribute, n=3)
extractGeneID<- function(x) gsub("gene_id ([^;]+);.*", "\\1", x)
# on linux command (it is all about writing regular expression!)
# generally,awk does not support back reference http://ubuntuforums.org/showthread.php?t=2198616
# but gensub function supports in gawk (Gnu awk)
# ex. $ echo 'noon' | awk '{print gensub(/(.)(.)/,"\\2\\1","g")}' output:onno
# zcat Homo_sapiens.GRCh37.74.gtf.gz| head |awk -F'\t' '{ a=gensub(/gene_id "([^;]+)";.*/, "\\1", "g",$9); print $0"\t"a}'
# the gene_id has quotes
# use sed instead:
# zcat Homo_sapiens.GRCh37.74.gtf.gz | awk -F '\t' '{print $9}' | sed 's/gene_id "\([^;]\+\)";.*/\1/'
# append this column to the original file:
# paste <(zcat Homo_sapiens.GRCh37.74.gtf.gz) <(zcat Homo_sapiens.GRCh37.74.gtf.gz | awk -F '\t' '{print $9}' | sed 's/gene_id "\([^;]\+\)";.*/\1/')
# this is a long linux command, I constructed bit by bit, make sure every step works.
# use mutate from dplyr to add a column based on original column
# add a column with GeneID
hs_df<- mutate(hs_df, gene_id = extractGeneID(attribute)) # takes several seconds
# select only the columns we are interested
select(hs_df, seqname, feature, start, end, gene_id)
select(hs_df, seqname, source, feature, start, end, gene_id)
table(hs_df$source) # on linux command zcat Homo_sapiens.GRCh37.74.gtf.gz | datamash -s -g 2 count 2
# use filer to keep the rows with source == "protein_coding"
pc_df<- filter(hs_df, source == "protein_coding") #fast
pc_df # zcat Homo_sapiens.GRCh37.74.gtf.gz | awk '$2=="protein_coding"'| wc -l returns 1667560
# multiple filter rules
filter(hs_df, feature == "exon", strand == "+") #ensemble 74 is different from ensemble 75 in which has a feature "gene"
# and comment lines
# use %>% (pronounced as "then") to get the exon number of each gene
# SQL can do the same job, see my previous post here http://crazyhottommy.blogspot.com/2013/11/mysql-rocks.html
# extract exon number
extractExonNumber <- function(x) as.integer(gsub(".*exon_number (\\d+);.*", "\\1", x))
pce_df <- filter(hs_df, source == "protein_coding", feature == "exon") %>%
mutate(exon_number=extractExonNumber(attribute))
pce_df %>% filter(feature=="exon") %>%
group_by(gene_id) %>%
summarise(num_exons = max(exon_number)) %>%
arrange(desc(num_exons))
# Source: local data frame [22,642 x 2]
# gene_id num_exons
# 1 ENSG00000155657 363
# 2 ENSG00000183091 182
# 3 ENSG00000131018 147
# 4 ENSG00000114270 118
# 5 ENSG00000054654 116
# 6 ENSG00000154358 116
# 7 ENSG00000203832 111
# 8 ENSG00000127481 107
# 9 ENSG00000143341 107
# 10 ENSG00000198626 107
view raw dplyr_gff.r hosted with ❤ by GitHub

Thursday, September 11, 2014

make a soft link

I got to know the ln linux command when I was reading The linux command line. I did not realize why I need this command to make links to the same data or file.

One situation I faced is that I have a lot of bam files  with long names in one folder, and when I execute a program in another folder, I have to pass the program a full path to the bam files.
while you think the TAB auto-completion helps most of the time,  it is still too much to specify the full path + long names (ugly). So I want to make a copy of the bam files in the current working director to save me from typing the full path. This is where ln command comes to rescue.

The other scenario is when you have multiple executable files in one folder ~/myprogram/bin, you want to execute the program from anywhere, you could link it to /usr/local/bin, or you can just copy the executable to /usr/local/bin or you can add ~/myprogram/bin to $PATH. see a link here http://unix.stackexchange.com/questions/116508/adding-to-path-vs-linking-from-bin

Let me demonstrate the usage of ln 

look at the man page
man ln
# I have two files in the playground directory and make two directories 

mkdir dir1 dir2
ls

output: tommy@tommy-ThinkPad-T420[playground] ls                              [ 2:32PM]
dir1  dir2  genes_h.txt  genes.txt


I then move the two txt files into dir1 and redirect to dir2 and make soft links of files in dir1 to dir2
mv *txt dir1
cd dir2
ln -s ../dir1/*txt .
ls
if we go to dir1 and make a softlink from there
cd ../dir1
ln -s ./*txt  ../dir2

The links will not work (too many levels of symbolic levels error message if you want to use the link ), instead you need to specify the full path

ln -s $PWD/*txt  ../dir2


see links here http://www.thegeekstuff.com/2010/10/linux-ln-command-examples/
and http://www.thegeekstuff.com/2010/10/linux-ln-command-examples/

Wednesday, September 10, 2014

converting gene ids using bioconductor with biomaRt and annotation packages

I had a post using  mygene to convert gene ids. Bioconductor can do the same job.
I put a gist on github.

##### use bioconductor annotation packages #######
source("http://Bioconductor.org/biocLite.R")
biocLite("org.Hs.eg.db")
biocLite(c("GenomicFeatures", "AnnotationDbi"))
library("org.Hs.eg.db")
library("AnnotationDbi")
library("GenomicFeatures")
# all the possible mappings
ls("package:org.Hs.eg.db")
# convert Entrez_ids to gene_symbols
myEntrez_ids <- c("1","10","100","1000","37690")
mySymbols<- mget(myEntrez_ids, org.Hs.egSYMBOL, ifnotfound=NA)
mySymbols
unlist(mySymbols)
# convert gene_symbols to Entrez_ids
mySymbols_2 <- c("VEGFA","CTCF", "SNAI1","KDM1A")
myEntrez_ids_2<- mget(mySymbols_2, org.Hs.egSYMBOL2EG, ifnotfound=NA)
unlist(myEntrez_ids_2)
?AnnotationDbi::mget # get help
# or use the select function
?AnnotationDbi::select
head(keys(org.Hs.eg.db))
keytypes(org.Hs.eg.db)
select(org.Hs.eg.db, keys = mySymbols_2, columns=c("SYMBOL","REFSEQ","GENENAME","ENTREZID"),keytype="SYMBOL")
select(org.Hs.eg.db, keys = myEntrez_ids, columns=c("SYMBOL","REFSEQ","GENENAME","ENTREZID"),keytype="ENTREZID")
# How many gene symbols
symbol <- keys(org.Hs.eg.db, "SYMBOL")
length(symbol)
############### use biomart ###################
library(biomaRt)
mart<- useMart(biomart = 'ensembl', dataset = 'hsapiens_gene_ensembl')
# get sequences
seq <- getSequence(id = 'BRCA1', type='hgnc_symbol',seqType="3utr", mart = mart) # pretty slow...
show(seq)
seq2 <-getSequence(id="ENST00000520540",type='ensembl_transcript_id',seqType='gene_flank', upstream =30, mart=mart)
show(seq2)
# convert gene ids gene symbol to refseq
geneList<- c("VEGFA","CTCF", "SNAI1","KDM1A")
results<- getBM(attributes = c("refseq_mrna","hgnc_symbol"), filters="hgnc_symbol", values=geneList, mart=mart)
results
?getBM
view raw convert_ids.r hosted with ❤ by GitHub
For more examples see posts from Dave Tang:
http://davetang.org/muse/2013/12/16/bioconductor-annotation-packages/
http://davetang.org/muse/2013/05/23/using-the-bioconductor-annotation-packages/
http://davetang.org/muse/2013/11/25/thoughts-converting-gene-identifiers/

Wednesday, September 3, 2014

mapping gene ids with mygene

Mapping gene ids is one of the routine jobs for bioinformatics. I was aware of several ways to do it including Biomart.

Update on 10/30/14, a mygene bioconductor package is online http://bioconductor.org/packages/release/bioc/html/mygene.html

Recently I got to know mygene, a python wrapper for the mygene.info services to map gene ids.
I found it very handy to convert gene ids.  see a gist below.

#! /usr/bin/env python
# ID mapping using mygene
# https://pypi.python.org/pypi/mygene
# http://nbviewer.ipython.org/gist/newgene/6771106
# http://mygene-py.readthedocs.org/en/latest/
# 08/30/14
__author__ = 'tommy'
import mygene
import fileinput
import sys
mg = mygene.MyGeneInfo()
# mapping gene symbols to Entrez gene ids and Ensemble gene ids.
# fileinput will loop through all the lines in the input specified as file names given in command-line arguments,
# or the standard input if no arguments are provided.
# build a list from an input file with one gene name in each line
def get_gene_symbols():
gene_symbols = []
for line in fileinput.input():
gene_symbol = line.strip() # assume each line contains only one gene symbol
gene_symbols.append(gene_symbol)
fileinput.close()
return gene_symbols
Entrez_ids = mg.querymany(get_gene_symbols(), scopes='symbol', fields='entrezgene, ensembl.gene', species='human',
as_dataframe=True, verbose=False)
# set as_dataframe to True will return a pandas dataframe object, verbose=False suppress the messages like "finished".
# Entrez_ids.to_csv(sys.stdout, sep="\t") # write the dataframe to stdout, but will not have NaNs on the screen
# if no matches were found
sys.stdout.write(Entrez_ids.to_string()) # sys.stdout.write() expects the character buffer object
# Entrez_ids.to_csv("Entrez_ids.txt", sep="\t") # write the pandas dataframe to csv
To use it,  cat input.txt | python geneSymbol2Entrez.py > output.txt
or python geneSymbol2Entrez.py input.txt > output.txt  where input.txt contains one gene name in each line. pretty neat!