Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Monday, November 10, 2014

csvkit to manipulate csv at command line, Rio to interact with R at command line

I came across csvkit long time ago, but I just began to use it and found it is very handy. It is a python utility to manipulate csv files at command line. you can install it by pip:

pip install csvkit

I played around with the iris data.
csvlook gives you a database view of the data by adding grids.
$ cat iris.csv | head | csvlook
|---------------+-------------+--------------+-------------+--------------|
|  sepal_length | sepal_width | petal_length | petal_width | species      |
|---------------+-------------+--------------+-------------+--------------|
|  5.1          | 3.5         | 1.4          | 0.2         | Iris-setosa  |
|  4.9          | 3.0         | 1.4          | 0.2         | Iris-setosa  |
|  4.7          | 3.2         | 1.3          | 0.2         | Iris-setosa  |
|  4.6          | 3.1         | 1.5          | 0.2         | Iris-setosa  |
|  5.0          | 3.6         | 1.4          | 0.2         | Iris-setosa  |
|  5.4          | 3.9         | 1.7          | 0.4         | Iris-setosa  |
|  4.6          | 3.4         | 1.4          | 0.3         | Iris-setosa  |
|  5.0          | 3.4         | 1.5          | 0.2         | Iris-setosa  |
|  4.4          | 2.9         | 1.4          | 0.2         | Iris-setosa  |
|---------------+-------------+--------------+-------------+--------------|

csvstat can output summary statistics for specific columns with -c flag
-c can accept both the column names and the column numbers as arguments

$ cat iris.csv | csvstat -c sepal_length,sepal_width
  1. sepal_length
<type 'float'>
Nulls: False
Min: 4.3
Max: 7.9
Sum: 876.5
Mean: 5.84333333333
Median: 5.8
Standard Deviation: 0.825301291785
Unique values: 35
5 most frequent values:
5.0: 10
6.3: 9
5.1: 9
6.7: 8
5.7: 8
  2. sepal_width
<type 'float'>
Nulls: False
Min: 2.0
Max: 4.4
Sum: 458.6
Mean: 3.05733333333
Median: 3.0
Standard Deviation: 0.434410967735
Unique values: 23
5 most frequent values:
3.0: 26
2.8: 14
3.2: 13
3.4: 12
3.1: 11

Row count: 150

you can also output only the max, min, mean by adding --max, --min etc
$ cat iris.csv | csvstat -c sepal_length,sepal_width --max
  1. sepal_length: 7.9
  2. sepal_width: 4.4

csvcut can cut out specific columns and reorder the column while the unix cut command can
not reorder the column, one needs to use awk to achieve that.

$ cat iris.csv | csvcut -c5,4,2 | csvlook | head
|------------------+-------------+--------------|
|  species         | petal_width | sepal_width  |
|------------------+-------------+--------------|
|  Iris-setosa     | 0.2         | 3.5          |
|  Iris-setosa     | 0.2         | 3.0          |
|  Iris-setosa     | 0.2         | 3.2          |
|  Iris-setosa     | 0.2         | 3.1          |
|  Iris-setosa     | 0.2         | 3.6          |
|  Iris-setosa     | 0.4         | 3.9          |
|  Iris-setosa     | 0.3         | 3.4          |

csvgrep can grep based on a specific column:
$ cat iris.csv | csvgrep -c species -m setosa | wc -l
51

other commands like csvsort, csvjoin, csvsql are also useful, please refer to the link above to see more examples.

I want to plot some nice figure at command line, but I am too lazy to fire Rstudio. by using Rio, you can leverage the power of R at the command line. you can then visualize it by using the display command from imagemagic
-g to load the ggplot2 package 
-e to execute the command

update on 11/19/2014, Rio has changed the flag to load dplry and tidyr
https://github.com/jeroenjanssens/data-science-at-the-command-line/commit/66fc1fda604f2d420de9490934259093a76fb105

boxplot of sepal_length for different species :



histogram for sepal_length for all species:


 you can change the color and fill of the bars:


you can also plot histogram for different species with a facet:


scatter plot of sepal_width for different species:


command line is awesome! I love it! It brings me more power!





Thursday, October 30, 2014

Test significance of overlapping genes from 2 gene sets

The question is very straightforward:
I have three microarray or RNA-seq data sets (control, treatment1, treatment2, each should have several replicates, otherwise you can not get a pvalue for differential expression!) , I then get two sets of differentially up-regulated genes ( treatment1 vs control, treatment2 vs control).  There are some overlapping genes which are both up-regulated by treatment1 and treatment2. How significant the overlapping is?

The easiest way is to use an online program http://nemates.org/MA/progs/overlap_stats.html
The program calculates the p value by using Hypergeometric_distribution.

see below for a python and an R script for this problem.

read  posts here http://stats.stackexchange.com/questions/16247/calculating-the-probability-of-gene-list-overlap-between-an-rna-seq-and-a-chip-c
http://blog.nextgenetics.net/?e=94
https://www.biostars.org/p/90662/

list1=3000, list2=400, Total gene number (population)=15000,
overlapping between list1 and list2 =100
#! /usr/bin/env python
import sys
import scipy.stats as stats
#The result will be
# a p-value where by random chance number of genes with both condition A and B will be <= to your number with condition A and B
# a p-value where by random chance number of genes with both condition A and B will be >= to your number with condition A and B
# The second p-value is probably what you want.
if len(sys.argv) < 5:
print '''use it by: python gene_sets_hypergeomtric_test.py [total number of genes in the list] [total number of genes in the list with condition A] [total number of genes in the list with condition B] [number of genes with both condition A and B]'''
sys.exit()
print
print 'total number in population: ' + sys.argv[1]
print 'total number with condition in population: ' + sys.argv[2]
print 'number in subset: ' + sys.argv[3]
print 'number with condition in subset: ' + sys.argv[4]
print
print 'p-value <= ' + sys.argv[4] + ': ' + str(stats.hypergeom.cdf(int(sys.argv[4]) + 1,int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[3])))
print
print 'p-value >= ' + sys.argv[4] + ': ' + str(stats.hypergeom.sf(int(sys.argv[4]) - 1,int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[3])))
print
# phyper(overlap-1,list1,PopSize-list1,list2,lower.tail = FALSE, log.p = FALSE)
phyper(100-1, 3000, 15000-3000, 400,lower.tail = FALSE, log.p = FALSE)
#0.00784
From the stackexchange link:
"The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.
sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)
These give 0.0078.
dhyper(x, m, n, k) gives the probability of drawing exactly x. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.
phyper(x, m, n, k) gives the probability of getting x or fewer, so phyper(x, m, n, k) is the same as sum(dhyper(0:x, m, n, k)).
The lower.tail=FALSE is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE) is the same as 1-phyper(x, m, n, k), and so is the probability of x+1 or more "

If I run with the python script, I got the same p-value:
tommy@tommy-ThinkPad-T420 ~/scripts_general_use
$ ./hyper_geometric.py 15000 3000 400 100

total number in population: 15000
total number with condition in population: 3000
number in subset: 400
number with condition in subset: 100

p-value <= 100: 0.996049535673

p-value >= 100: 0.00784035207977
The bottom line is that we need more statistics!!

Thursday, October 23, 2014

convert bam file to bigwig file and visualize in UCSC genome browser in a Box (GBiB)

I just came across UCSC genome browser in a box Gbib, and I want to try it out.

what is Gbib?
from the link above:

"Genome Browser in a Box (GBiB) is a "virtual machine" of the entire UCSC Genome Browser website that is designed to run on most PCs (Windows, Mac OSX or Linux). GBiB allows you to access much of the UCSC Genome Browser's functionality from the comfort of your own computer. It is particularly directed at individuals who want to use the Genome Browser's functionality to view protected data. With the anticipation that the majority of protected data use will focus on the human genomes (primarily GRCh37/hg19, and transitioning to GRCh38/hg38), the current version of GBiB has been optimized for use with the hg19 assembly. Many other recent genome assemblies can also be viewed, but without mirroring of additional data, they may be slow."

Download the compressed file from here https://genome-store.ucsc.edu/ and follow the installation instructions. Open virtual box and start the virtual machine and then direct your web browser to 127.0.0.1:1234, you should see the UCSC genome browser page.

It looks the same as the UCSC genome browser page https://genome.ucsc.edu/. You can then go to the browser and do what you usually do in the genome browser. when I searched a gene, and errors came out:
  • nextRow failed
  • mySQL error 1194: Table 'rmsk' is marked as crashed and should be repaired (profile=db, host=localhost, db=hg19)

I asked the UCSC mail list, and I typed the command below in the console:
 sudo mysqlcheck --all-databases --auto-repair --fast
it works like a charm now!

I have some bam files and I want to visualize them in the UCSC genome browser. I would first convert them to bigwig http://genome.ucsc.edu/goldenpath/help/bigWig.html and then upload to some open http https or ftp locations.  see my old post:
hosting bigwig by Dropbox for UCSC

However, with GBiB one only needs to share a local folder containing the bigwig files with virtual box.

"Loading Local Big Data Tracks
Your computer can share directories with GBiB so that you can load big files without the need to upload them to a web server. The big file formats are compressed, indexed, binary files, and include bigBed,bigWigBAM, and VCF files. Normally, one would have to place these types files onto a publicly accessible web server to upload them as a custom track. However, GBiB acts as its own web server, meaning that you can share local files with GBiB for easy uploading as a custom track."

I will start from converting bam to bigwig.
several posts you may want to have a look:
https://www.biostars.org/p/64495/#64680
https://www.biostars.org/p/2699/
http://onetipperday.blogspot.com/2012/07/three-ways-to-convert-bambed-file-to.html

I used genomeCoverageBed from bedtools to convert bam to bedgraph first, and then used UCSC utilities bedClip and bedGraphToBigWig http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ to do the conversion.

bedtools requires the bam files be sorted. so samtools sort was used to sort the bam file first.
I used a script from Tao Liu (the author of famous MACS ChIP-seq calling software) to do the conversion after I got the bedgraph files. I then put the bigwig files in a folder and shared it with virtual box. Follow the instructions http://genome.ucsc.edu/goldenPath/help/gbib#LocalTracks. I successfully visualized a bunch of bigwig files!

#! /bin/bash
for bam in *bam
do
echo $bam
genomeCoverageBed -ibam $bam -bg -g hg19.genome.info > $(basename $bam .bam).bdg
done
for bdg in *bdg
do
echo $bdg
bdg2bw $bdg hg19.genome.info
done
view raw bam2bw.sh hosted with ❤ by GitHub
#!/bin/bash
# this script is from Tao Liu https://gist.github.com/taoliu/2469050
# check commands: slopBed, bedGraphToBigWig and bedClip
which bedtools &>/dev/null || { echo "bedtools not found! Download bedTools: <http://code.google.com/p/bedtools/>"; exit 1; }
which bedGraphToBigWig &>/dev/null || { echo "bedGraphToBigWig not found! Download: <http://hgdownload.cse.ucsc.edu/admin/exe/>"; exit 1; }
which bedClip &>/dev/null || { echo "bedClip not found! Download: <http://hgdownload.cse.ucsc.edu/admin/exe/>"; exit 1; }
# end of checking
if [ $# -lt 2 ];then
echo "Need 2 parameters! <bedgraph> <chrom info>"
exit
fi
F=$1
G=$2
bedtools slop -i ${F} -g ${G} -b 0 | bedClip stdin ${G} ${F}.clip
bedGraphToBigWig ${F}.clip ${G} ${F/bdg/bw}
rm -f ${F}.clip
view raw bdg2bw.sh hosted with ❤ by GitHub


Wednesday, October 22, 2014

speed up grep

I had a list of gene names in txt file. There are around 500 genes with one gene name in one line, and I want to filter the gtf file from ensemble Homo_sapines.GRCh37.74gtf.gz

the gtf file contains 2244857 lines. I used grep to do it, but it takes very long (~1 hour).

what I used:
zcat Homo_sapines.GRCh37.74gtf.gz | grep -f gene_names.txt -w > my_genes.gtf

I searched on line, and found several posts in stackoverflow to speed up grep:
http://stackoverflow.com/questions/14602963/faster-grep-function-for-big-27gb-files
http://stackoverflow.com/questions/13913014/grepping-a-huge-file-80gb-any-way-to-speed-it-up
http://stackoverflow.com/questions/9066609/fastest-possible-grep

options to speed up:


1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.
2) Use fgrep because you're searching for a fixed string, not a regular expression

I then used:
zcat Homo_sapines.GRCh37.74gtf.gz | LC_ALL=C fgrep -f gene_names.txt -w > my_genes.gtf


It runs much faster!

Friday, September 26, 2014

everyone will write a blog post on dplyr

dply is a nice R package to manipulate big data. There are several good tutorials I came across:

http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/
http://stat545-ubc.github.io/block009_dplyr-intro.html
http://stat545-ubc.github.io/bit001_dplyr-cheatsheet.html
http://gettinggeneticsdone.blogspot.com/2014/08/do-your-data-janitor-work-like-boss.html

I followed the example in the early edition of Bioinformatics Data Skills on the dplyr part and put a gist below. The example used dplyr to manipulate and summarize the human annotation gff file downloaded here http://useast.ensembl.org/info/data/ftp/index.html.

I downloaded the ensemble 74 build which is different from the 75 build in the example of the book.
I believe the gencode v19 annotation file is based on ensemble 74.

I also included comments in the R code describing how to do the same job with linux command line.
library(dplyr)
setwd("/home/tommy/annotations/human/ensemble/")
# set the colClasses for faster reading in the data
gtf_cols <- c(seqname="factor", source="factor", feature="factor",
start="integer", end="integer", score="character",
strand="factor", frame="factor", attribute="character")
hs_gtf <- read.delim('Homo_sapiens.GRCh37.74.gtf.gz', header=FALSE,
col.names=names(gtf_cols), comment.char="#")
# load all the data into memory, this gz file is around 28Mb, if gunzip it will be around 900Mb
# very slow, it took me 2-3 mins to finish loading.....try data.table. usually I work with gtf file
# on command line. of course, specialized tools such as python library HTSeq and gffutils, bioawk etc are
# also existed there for more complicated analysis.
head(hs_gtf)
dim(hs_gtf) # how many rows and columns 2 million rows
hs_df<- tbl_df(hs_gtf) #convert to a local dataframe, when you print, it will not
# print out all the rows. similar to head()
head(hs_df$attribute, n=3)
extractGeneID<- function(x) gsub("gene_id ([^;]+);.*", "\\1", x)
# on linux command (it is all about writing regular expression!)
# generally,awk does not support back reference http://ubuntuforums.org/showthread.php?t=2198616
# but gensub function supports in gawk (Gnu awk)
# ex. $ echo 'noon' | awk '{print gensub(/(.)(.)/,"\\2\\1","g")}' output:onno
# zcat Homo_sapiens.GRCh37.74.gtf.gz| head |awk -F'\t' '{ a=gensub(/gene_id "([^;]+)";.*/, "\\1", "g",$9); print $0"\t"a}'
# the gene_id has quotes
# use sed instead:
# zcat Homo_sapiens.GRCh37.74.gtf.gz | awk -F '\t' '{print $9}' | sed 's/gene_id "\([^;]\+\)";.*/\1/'
# append this column to the original file:
# paste <(zcat Homo_sapiens.GRCh37.74.gtf.gz) <(zcat Homo_sapiens.GRCh37.74.gtf.gz | awk -F '\t' '{print $9}' | sed 's/gene_id "\([^;]\+\)";.*/\1/')
# this is a long linux command, I constructed bit by bit, make sure every step works.
# use mutate from dplyr to add a column based on original column
# add a column with GeneID
hs_df<- mutate(hs_df, gene_id = extractGeneID(attribute)) # takes several seconds
# select only the columns we are interested
select(hs_df, seqname, feature, start, end, gene_id)
select(hs_df, seqname, source, feature, start, end, gene_id)
table(hs_df$source) # on linux command zcat Homo_sapiens.GRCh37.74.gtf.gz | datamash -s -g 2 count 2
# use filer to keep the rows with source == "protein_coding"
pc_df<- filter(hs_df, source == "protein_coding") #fast
pc_df # zcat Homo_sapiens.GRCh37.74.gtf.gz | awk '$2=="protein_coding"'| wc -l returns 1667560
# multiple filter rules
filter(hs_df, feature == "exon", strand == "+") #ensemble 74 is different from ensemble 75 in which has a feature "gene"
# and comment lines
# use %>% (pronounced as "then") to get the exon number of each gene
# SQL can do the same job, see my previous post here http://crazyhottommy.blogspot.com/2013/11/mysql-rocks.html
# extract exon number
extractExonNumber <- function(x) as.integer(gsub(".*exon_number (\\d+);.*", "\\1", x))
pce_df <- filter(hs_df, source == "protein_coding", feature == "exon") %>%
mutate(exon_number=extractExonNumber(attribute))
pce_df %>% filter(feature=="exon") %>%
group_by(gene_id) %>%
summarise(num_exons = max(exon_number)) %>%
arrange(desc(num_exons))
# Source: local data frame [22,642 x 2]
# gene_id num_exons
# 1 ENSG00000155657 363
# 2 ENSG00000183091 182
# 3 ENSG00000131018 147
# 4 ENSG00000114270 118
# 5 ENSG00000054654 116
# 6 ENSG00000154358 116
# 7 ENSG00000203832 111
# 8 ENSG00000127481 107
# 9 ENSG00000143341 107
# 10 ENSG00000198626 107
view raw dplyr_gff.r hosted with ❤ by GitHub

Thursday, September 11, 2014

make a soft link

I got to know the ln linux command when I was reading The linux command line. I did not realize why I need this command to make links to the same data or file.

One situation I faced is that I have a lot of bam files  with long names in one folder, and when I execute a program in another folder, I have to pass the program a full path to the bam files.
while you think the TAB auto-completion helps most of the time,  it is still too much to specify the full path + long names (ugly). So I want to make a copy of the bam files in the current working director to save me from typing the full path. This is where ln command comes to rescue.

The other scenario is when you have multiple executable files in one folder ~/myprogram/bin, you want to execute the program from anywhere, you could link it to /usr/local/bin, or you can just copy the executable to /usr/local/bin or you can add ~/myprogram/bin to $PATH. see a link here http://unix.stackexchange.com/questions/116508/adding-to-path-vs-linking-from-bin

Let me demonstrate the usage of ln 

look at the man page
man ln
# I have two files in the playground directory and make two directories 

mkdir dir1 dir2
ls

output: tommy@tommy-ThinkPad-T420[playground] ls                              [ 2:32PM]
dir1  dir2  genes_h.txt  genes.txt


I then move the two txt files into dir1 and redirect to dir2 and make soft links of files in dir1 to dir2
mv *txt dir1
cd dir2
ln -s ../dir1/*txt .
ls
if we go to dir1 and make a softlink from there
cd ../dir1
ln -s ./*txt  ../dir2

The links will not work (too many levels of symbolic levels error message if you want to use the link ), instead you need to specify the full path

ln -s $PWD/*txt  ../dir2


see links here http://www.thegeekstuff.com/2010/10/linux-ln-command-examples/
and http://www.thegeekstuff.com/2010/10/linux-ln-command-examples/

Wednesday, September 10, 2014

converting gene ids using bioconductor with biomaRt and annotation packages

I had a post using  mygene to convert gene ids. Bioconductor can do the same job.
I put a gist on github.

##### use bioconductor annotation packages #######
source("http://Bioconductor.org/biocLite.R")
biocLite("org.Hs.eg.db")
biocLite(c("GenomicFeatures", "AnnotationDbi"))
library("org.Hs.eg.db")
library("AnnotationDbi")
library("GenomicFeatures")
# all the possible mappings
ls("package:org.Hs.eg.db")
# convert Entrez_ids to gene_symbols
myEntrez_ids <- c("1","10","100","1000","37690")
mySymbols<- mget(myEntrez_ids, org.Hs.egSYMBOL, ifnotfound=NA)
mySymbols
unlist(mySymbols)
# convert gene_symbols to Entrez_ids
mySymbols_2 <- c("VEGFA","CTCF", "SNAI1","KDM1A")
myEntrez_ids_2<- mget(mySymbols_2, org.Hs.egSYMBOL2EG, ifnotfound=NA)
unlist(myEntrez_ids_2)
?AnnotationDbi::mget # get help
# or use the select function
?AnnotationDbi::select
head(keys(org.Hs.eg.db))
keytypes(org.Hs.eg.db)
select(org.Hs.eg.db, keys = mySymbols_2, columns=c("SYMBOL","REFSEQ","GENENAME","ENTREZID"),keytype="SYMBOL")
select(org.Hs.eg.db, keys = myEntrez_ids, columns=c("SYMBOL","REFSEQ","GENENAME","ENTREZID"),keytype="ENTREZID")
# How many gene symbols
symbol <- keys(org.Hs.eg.db, "SYMBOL")
length(symbol)
############### use biomart ###################
library(biomaRt)
mart<- useMart(biomart = 'ensembl', dataset = 'hsapiens_gene_ensembl')
# get sequences
seq <- getSequence(id = 'BRCA1', type='hgnc_symbol',seqType="3utr", mart = mart) # pretty slow...
show(seq)
seq2 <-getSequence(id="ENST00000520540",type='ensembl_transcript_id',seqType='gene_flank', upstream =30, mart=mart)
show(seq2)
# convert gene ids gene symbol to refseq
geneList<- c("VEGFA","CTCF", "SNAI1","KDM1A")
results<- getBM(attributes = c("refseq_mrna","hgnc_symbol"), filters="hgnc_symbol", values=geneList, mart=mart)
results
?getBM
view raw convert_ids.r hosted with ❤ by GitHub
For more examples see posts from Dave Tang:
http://davetang.org/muse/2013/12/16/bioconductor-annotation-packages/
http://davetang.org/muse/2013/05/23/using-the-bioconductor-annotation-packages/
http://davetang.org/muse/2013/11/25/thoughts-converting-gene-identifiers/

Wednesday, September 3, 2014

mapping gene ids with mygene

Mapping gene ids is one of the routine jobs for bioinformatics. I was aware of several ways to do it including Biomart.

Update on 10/30/14, a mygene bioconductor package is online http://bioconductor.org/packages/release/bioc/html/mygene.html

Recently I got to know mygene, a python wrapper for the mygene.info services to map gene ids.
I found it very handy to convert gene ids.  see a gist below.

#! /usr/bin/env python
# ID mapping using mygene
# https://pypi.python.org/pypi/mygene
# http://nbviewer.ipython.org/gist/newgene/6771106
# http://mygene-py.readthedocs.org/en/latest/
# 08/30/14
__author__ = 'tommy'
import mygene
import fileinput
import sys
mg = mygene.MyGeneInfo()
# mapping gene symbols to Entrez gene ids and Ensemble gene ids.
# fileinput will loop through all the lines in the input specified as file names given in command-line arguments,
# or the standard input if no arguments are provided.
# build a list from an input file with one gene name in each line
def get_gene_symbols():
gene_symbols = []
for line in fileinput.input():
gene_symbol = line.strip() # assume each line contains only one gene symbol
gene_symbols.append(gene_symbol)
fileinput.close()
return gene_symbols
Entrez_ids = mg.querymany(get_gene_symbols(), scopes='symbol', fields='entrezgene, ensembl.gene', species='human',
as_dataframe=True, verbose=False)
# set as_dataframe to True will return a pandas dataframe object, verbose=False suppress the messages like "finished".
# Entrez_ids.to_csv(sys.stdout, sep="\t") # write the dataframe to stdout, but will not have NaNs on the screen
# if no matches were found
sys.stdout.write(Entrez_ids.to_string()) # sys.stdout.write() expects the character buffer object
# Entrez_ids.to_csv("Entrez_ids.txt", sep="\t") # write the pandas dataframe to csv
To use it,  cat input.txt | python geneSymbol2Entrez.py > output.txt
or python geneSymbol2Entrez.py input.txt > output.txt  where input.txt contains one gene name in each line. pretty neat!

Thursday, August 14, 2014

rename a bunch of files with bash by regular expression

I am at the MSU NGS 2014 course. My colleague wanted to follow the khmer protocol with his own data, but one of the steps has to use a certain file name convention.

In the protocol it requires fastq files listed as:   *001_R1.fastq.gz
001 is the replicate number, it can be 002 or 003 or any number of replicates you have. ( for RNA-seq, sequence as many as biological samples as possible !)
R1 is the pair-end reads 1, it can be R2

What he has is something like:
1_egg_r1_01_sub.fastq.gz

1 is the stage of the egg. He sequenced  4 eggs, so he has 1_egg, 2_egg., 3_egg and 4_egg
r1 is the pair-end reads 1
01 is the first replicates. He has two replicates for each egg.

Basically, he wants to rename these files to the khmer convention.

This problem gets down to writing a regular expression.
To recapture the problem, I made some dummy files:

mkdir foo && cd foo

I have a txt file contains the names of the file:
foo$ cat files.txt 
1egg_r1_01_sub.fastq.gz
1egg_r2_01_sub.fastq.gz
1egg_r1_02_sub.fastq.gz
1egg_r2_02_sub.fastq.gz
2egg_r1_01_sub.fastq.gz
2egg_r2_01_sub.fastq.gz
2egg_r1_02_sub.fastq.gz
2egg_r2_02_sub.fastq.gz
3egg_r1_01_sub.fastq.gz
3egg_r2_01_sub.fastq.gz
3egg_r1_02_sub.fastq.gz
3egg_r2_02_sub.fastq.gz
4egg_r1_01_sub.fastq.gz
4egg_r2_01_sub.fastq.gz
4egg_r1_02_sub.fastq.gz
4egg_r2_02_sub.fastq.gz

Now I want to make dummy files with the names in this file.
one can make the dummy files in a fly also.

=====update on 08/26/14======
one can use the {} expansion to create the dummy files

tommy@tommy-ThinkPad-T420[foo] touch {1,2,3,4}_r{1,2}_0{1,2}_sub.fastq.gz
tommy@tommy-ThinkPad-T420[foo] ls                                     [ 3:45PM]
1_r1_01_sub.fastq.gz  2_r2_01_sub.fastq.gz  4_r1_01_sub.fastq.gz
1_r1_02_sub.fastq.gz  2_r2_02_sub.fastq.gz  4_r1_02_sub.fastq.gz
1_r2_01_sub.fastq.gz  3_r1_01_sub.fastq.gz  4_r2_01_sub.fastq.gz
1_r2_02_sub.fastq.gz  3_r1_02_sub.fastq.gz  4_r2_02_sub.fastq.gz
2_r1_01_sub.fastq.gz  3_r2_01_sub.fastq.gz
2_r1_02_sub.fastq.gz  3_r2_02_sub.fastq.gz




========================
!# /usr/bin/bash
while read name
do
echo "Name read from file - $name"
touch $name
done < $1
while read name
do
echo "Name read from file - $name"
touch $name
done < files.txt
for i in 1 2 3 4
do
for j in 1 2
do
for k in 1 2
do
touch $i\_egg\_r$j\_0$k\_sub.fastq.gz
done
done
done
The difference of make_dummy_file.sh and make_dummy_file_1.sh is that I specified shebang line in the make_dummy_file.sh script to tell the bash that it is a bash script, to invoke it: ./make_dummy_file.sh files.txt

In contrast, to invoke the other two which I did not specify the shebang: bash make_dummy_file_1.sh bash make_dummy_file_2.sh

Rename the files with regular expression by either using sed or rename command
for fspec1 in *.gz
do
#echo $fspec1
fspec2=$(echo ${fspec1} | sed "s/\([1-4]egg\)_r\([1-2]\)_0\([1-2]\)_sub.fastq.gz/\1_R\3_00\2.fastq.gz/")
echo $fspec2
mv ${fspec1} ${fspec2}
done
view raw rename.sh hosted with ❤ by GitHub
rename "s/([1-4]egg)_r([1-2])_0([1-2])_sub.fastq.gz/\$1_R\$3_00\$2.fastq.gz/" *fastq.gz

the rename command use the perl regular expression. use \ to escape $.
the sed command need to escape the () which are used to capture the back reference
before:
tommy@tommy-ThinkPad-T420:~/foo$ ls 1_egg_r1_01_sub.fastq.gz 2_egg_r1_01_sub.fastq.gz 3_egg_r1_01_sub.fastq.gz 4_egg_r1_01_sub.fastq.gz copy make_dummy_file_1.sh 1_egg_r1_02_sub.fastq.gz 2_egg_r1_02_sub.fastq.gz 3_egg_r1_02_sub.fastq.gz 4_egg_r1_02_sub.fastq.gz dummy make_dummy_file_2.sh 1_egg_r2_01_sub.fastq.gz 2_egg_r2_01_sub.fastq.gz 3_egg_r2_01_sub.fastq.gz 4_egg_r2_01_sub.fastq.gz files.txt rename.sh 1_egg_r2_02_sub.fastq.gz 2_egg_r2_02_sub.fastq.gz 3_egg_r2_02_sub.fastq.gz 4_egg_r2_02_sub.fastq.gz make_dummy_file.sh rename_one_liner.sh 

after:
tommy@tommy-ThinkPad-T420:~/foo$ ls 1egg_R1_001.fastq.gz 2egg_R1_001.fastq.gz 3egg_R1_001.fastq.gz 4egg_R1_001.fastq.gz copy make_dummy_file_1.sh 1egg_R1_002.fastq.gz 2egg_R1_002.fastq.gz 3egg_R1_002.fastq.gz 4egg_R1_002.fastq.gz dummy make_dummy_file_2.sh 1egg_R2_001.fastq.gz 2egg_R2_001.fastq.gz 3egg_R2_001.fastq.gz 4egg_R2_001.fastq.gz files.txt rename.sh 1egg_R2_002.fastq.gz 2egg_R2_002.fastq.gz 3egg_R2_002.fastq.gz 4egg_R2_002.fastq.gz make_dummy_file.sh rename_one_liner.sh

References: http://stackoverflow.com/questions/399078/what-special-characters-must-be-escaped-in-regular-expressions

http://stackoverflow.com/questions/10929453/bash-scripting-read-file-line-by-line
https://www.cs.tut.fi/~jkorpela/perl/regexp.html

Friday, August 8, 2014

R commands basics

we are on day 5 of the MSU NGS course. This morning, Ian Dworkin introduced some basic R.
I found it refreshing and put a gist below.
Pick one language, and learn it well!
pick up a dataset, and play with it!  Happy coding!
By the way, the food here at KBS is amazing, I am gaining weight :)


#2014 MSU NGS R basics tutorial
#http://angus.readthedocs.org/en/2014/R_Introductory_tutorial_2014.html
#https://github.com/jrherr/quick_basic_R_tutorial/blob/master/R_tutorial.md
#pick one language, and learn it well!
#pick up a dataset, play with it!
#object-oriented programming
#functional programming
#deal with big data in R: (R holds all the data in memory)
#http://theodi.org/blog/fig-data-11-tips-how-handle-big-data-r-and-1-bad-pun
#http://r-pbd.org/
#packages: plyr, dplyr, ggplot2, reshape2, data.table (fread function)
# commands start here!!
q() # quit R
getwd() # get working directory
setwd() # set working directory
y<- 2 # assign a variable
x<- 3
x + y # treat it as a calculator
x * y
x/y
x %% y
x == 3 # equal sign, will reture a logical vector: True or False
# in R, True and False have numerical values: True resolves to 1 , False resolves to 0
# exponents ** or ^
2**2 # returns 4
2^2 # returns 4
log(2.7) # natural log returns 0.99325
log(4,2) # returns 2
a<- c(2,3,6,8) # assign a vector use c denotes concatenate
b<- c(3,5,6,7)
a + b #
a * b #
# when length of a and b are different, R will recycle the longer one and gives a warning
length(a) # length of a, returns 4
new_varaible <- c(a,b) # concatenate two variables
crap<- rep(1:100) # index starts at 1
rm (crap) # remove this variable
?lm # get help for linear regression model function
# simple functions
a<- c(2,3,6,8)
mean(a)
sum(a)
var(a)
b<- c(3,5,6,7)
cor(a,b) # pearson correlation for two vectors
m<- cbind(a,b) # column bind the vector to a matrix
m
#> m
# a b
# [1,] 2 3
# [2,] 3 5
# [3,] 6 6
# [4,] 8 7
cor(m) # pearson correlation for columns of a matrix
cor(m, method="spearman") # spearman correlation of columns of a matrix
?cor
mode (a) # numeric
class (a) # numeric
class (m) # matrix
typeof(m)
str(m) # structure of m, try it out in your R console
dim(m) # dimension of m: 4 2
nrow(m) # number of rows 4
ncol(m) # number of columns 2
length(m) # 8
is.matrix(m) # True
# create a matrix from scratch
#> m1<- matrix(1:12,3,4)
#> m1
# [,1] [,2] [,3] [,4]
#[1,] 1 4 7 10
#[2,] 2 5 8 11
#[3,] 3 6 9 12
# strings
cities<- c("E.lansing", "Gainesville", "Shanghai", "Yichun")
class(cities)
#[1] "character"
length(cities) # 4 cities in the vector
nchar(cities) # number of characters for each city
#[1] 9 11 8 6
sum(nchar(cities))
#[1] 34
rivers<- c("Red Cedar", "swamp", "Huang Pu", "Long He")
cities_rivers<- cbind (cities,rivers)
cities_rivers
# cities rivers
#[1,] "E.lansing" "Red Cedar"
#[2,] "Gainesville" "swamp"
#[3,] "Shanghai" "Huang Pu"
#[4,] "Yichun" "Long He"
class(cities_rivers) # matrix
mode (citeis_rivers) # character
model_1<- y ~ x1 + x2 + x1:x2
model_1
#> class(model_1)
#[1] "formula"
counts_transcript_a <- c(250, 157, 155, 300, 125, 100, 153, 175)
genotype <- gl(n=2, k=4, labels = c("wild_type", "mutant"))
#> genotype
#[1] wild_type wild_type wild_type wild_type mutant
#[6] mutant mutant mutant
#Levels: wild_type mutant
#alternative to gl function, one can
genotype1<- factor(rep(c("wild_type","mutant"),each=4))
#> genotype1
#[1] wild_type wild_type wild_type wild_type mutant
#[6] mutant mutant mutant
#Levels: mutant wild_type
#notice the use of the "each" argument
genotype1<- factor(rep(c("wild_type","mutant"),4))
#> genotype1
#[1] wild_type mutant wild_type mutant wild_type
#[6] mutant wild_type mutant
#Levels: mutant wild_type
#also, notice that the levels are different with that generated by gl function
#we want the wild_type to be the base level. Instead, do:
genotype1<- factor(rep(c("wild_type","mutant"),each=4), levels=c("wild_type","mutant"))
#> genotype1
#[1] wild_type wild_type wild_type wild_type mutant mutant mutant mutant
#Levels: wild_type mutant
?relevel # try it also
expression_data <- data.frame(counts_transcript_a, genotype)
#> expression_data
# counts_transcript_a genotype
#1 250 wild_type
#2 157 wild_type
#3 155 wild_type
#4 300 wild_type
#5 125 mutant
#6 100 mutant
#7 153 mutant
#8 175 mutant
expression_data$counts_transcript_a # access a column of the dataframe
ls() # objects in the enviroment
rm(list=ls()) # remove all the objects in the enviroment
### write functions
StdErr <- function(vector) {
sd(vector)/sqrt(length(vector))
}
CoefVar<- function(vector){
sd(vector)/mean(vector)
}
# apply families http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
# with
> with(expression_data, tapply(X=counts_transcript_a, INDEX=genotype, FUN=mean))
wild_type mutant
215.50 138.25
# some commonly used functions, try ?to understand them
head() # print the first 6 lines, different with linux (default 10 lines)
table()
rownames()
colnames()
nrow()
ncol()
by()
with()
rowSums()
rowMeans()
summary()
# construct sequences
one_to_20<- 1:20
twenty_to_1<- 20:1
seq1<- seq(from =1, to = 20, by 0.5)
# or seq1<- seq(1,20,0.5)
# repeat numbers
many_2<- rep(2, times=20)
many_a<- rep("a", times=20)
seq_rep<- rep(1:10, times=2)
rep_3_times<- rep(c(1,2,3), times=3)
# different
rep_each_3_times<- rep(c(1,2,3), each=3)
# to do: subsetting for vectors and matrix
# R mark-down

Thursday, August 7, 2014

Understanding the Forward strand and Reverse strand and the coordinates systems

we are on day 4 of the MSU NGS course. In the morning, instructor Istvan introduced Genomic Intervals. To understand the coordinates system, one needs to understand the strandness of DNA.

sense strand is the coding strand
anti-sense strand is the reverse-complementary strand of the coding strand
see details below:
https://www.biostars.org/p/3423/
https://www.biostars.org/p/3908/
again, everything is on biostar :)


I drew a picture to better understand it


remember:
coordinates are reported 5'---> 3'  forward strand
transcription occurs from 5' to 3'
forward/plus strand and reverse/reverse strand are designated arbitrarily.
Imagine that you can flip over the example I drew, then gene A would be in minus strand.

# 0-based and 1-based coordinates system

0 based and 1 based coordinates  cheat sheet
https://www.biostars.org/p/84686/

various formats: http://genome.ucsc.edu/FAQ/FAQformat.html
GFF3 specification: http://www.sequenceontology.org/gff3.shtml
0-based formats:BED, wiggle, BEDGRAPH
1-based formats: GFF, GTF, GBK (genebank file), SAM, VCF


# lift over coordinates
lift-over between different versions of genome https://genome.ucsc.edu/util.html
Generally do not do it, just map to the right version of interest.
By the way, the latest human genome GRCh38 is released: http://www.ensembl.info/blog/2014/08/07/ensembl-76-has-been-released/

Wednesday, August 6, 2014

linux commands basics

I am attending the NGS course at MSU. This is a great course with great instructors and friendly colleagues.
I highly recommend this course to everyone. http://bioinformatics.msu.edu/ngs-summer-course-2014

This morning, we learned SNP calling by samtools and sam file specification (I will write another blog for the SNP calling) .in the night , TA Elijah gave an awesome introduction to linux commands.
personally, I think this should be taught in the first day of the course. ( I am already pretty familiar with basic linux commands, but it does cause a lot of frustrations for beginners).

I took the notes, and put the commands that taught in a gist, see below and enjoy linux commands!

#linux commands basics
#http://software-carpentry.org/v5/novice/shell/index.html
# practise, practise, practise, google, google, google and you will get it :)
pwd # print working directory
cd # change directory
sudo # super user privilege
chmod 775 # change the privileges http://en.wikipedia.org/wiki/Chmod
git clone # version control! get to know git and github! http://git-scm.com/
sudo bash # bad habit
whoami # who you are
exit
logout
echo # it's like print
ls -sh # give the size information in MB etc
ls -F # color your folders vs files
ls -l # long format
ls -a # list everything including invisible files
ls /home/tommy/data # list files in the data folder by specifying a full path
clear # clear the screen
# remember the tab magic auto-fill
# clarify the filesystem: slash / denotes the root directory ~ denotes the home directory
cd - # change directory to your previous directory
cd ~ or just cd # change directory to your home directory
cd .. # change directory one level up
cd ../.. # change directory to two levels up
cp /home/tommy/data.txt . # copy data.txt to the current directory which denoted by .
# create things txt editors!
nano # use this one, because it is easy
emacs # do not use it until later stage
vi # 90% of the people do not know how to exit vim :)
cat mydata.txt # print the content of mydata.txt to the stand output(the screen)
cat mydata1.txt mydata2.txt > total_data.txt # concatenate two files to a new file
ctrl + D # quit (indicate cat that this is the last line) when you do something like: cat > mydata.txt
head -15 # print the first 15 lines to the screen
tail # print the last 10 lines to the screen by default
less # use less to peek txt files. less is more
#get help
man # man cat will show you all the flags (arguments) for cat command
info
# control + c to quit any command
rm # dangerous remove files (permanetly gone)
rm -fr #remove directory
move # rename files
move mydata.txt my_new_data.txt # rename mydata.txt to my_new_data.txt
cp # copy files. do not put space in your file names in linux, just remember it.
#############################
# text manipulation again use man to explore the flags for all the commands below or google it!
sed # a book can be dedicated to explain the usage of it http://www.thegeekstuff.com/sed-awk-101-hacks-ebook/
awk # a book can be dedicated to explain the usage of it http://www.thegeekstuff.com/sed-awk-101-hacks-ebook/
cut # one of the most frequently commands I use, cut the columns out of a txt file
cut -f1,3 mydata.txt # cut out the first and third columns out and print it to screen
paste # paste two files side by side
comm # look common lines of two files, files need to be sorted first.
comm -12 # will supress the unique lines in each file
diff # look different lines of two files.
wc # word count
wc -l # line number
grep ATCG mydata.txt | head # find lines containing ATCG in mydata.txt and look at it by head
grep -v ATCG mydata.txt > no_ATCG.txt # find lines not containing ATCG and redirect to a new file named no_ATCG.txt
nl # number lines of your output
sort -k2,2 -nr # reverse numerical sort based on my second column
###############################
#up-arrow to reuse your previous command
control + r #reverse search your command
history # your command history
# redirect with >, instead print the output in the command screen, redirect it to a new file
head -n 1000 mydata.txt > newdata.txt
# redirect the first 1000 lines to a new file named newdata.txt
ls . > file_names.txt
# redirect all the file names in the current directory to a file named file_names.txt
##########################
# pipes | where the power comes
history | less # look at your history by less, type q to quit less
cat mydata.txt | sort -k2,2 -nr > whole_sorted_data.txt # I am a useless cat fan
# the upper one equals to
sort -k2,2 -nr mydata.txt> whole_sorted_data.txt
head -n 1000 mydata.txt | sort -k2,2 -nr > my_selected_sorted_data.txt
###########################
# use wildcard characters * ? .
# again regular expressions! http://regexone.com/
##################### loops
for file in *.xml; do head -3 $file; done
for file in *.xml; do head -3 $file >> all_head3_xml.txt; done
for file in *.xml; do head -3 ${file} > ${file}.txt; done # rename the files by adding txt suffix keeping the old name of the *.xml files.
Creative Commons License
linux basics by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Friday, July 25, 2014

run a local blast

I was checking the quality of a ChIP-seq fastq with fastqc, and the overrepresented sequence showed that the data contain Illumina Single End Adapter 1 contamination.

I have a file which contains all the possible illumina library sequences in fasta format (got from here), and I manually looked it and found that the single End Adapter 1 sequence did not match the report. I then was wondering  which sequence is the contamination. To do that, I have to do a local blast of  the overrepresented sequence "GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTAGATCGGAAGAGCTCGTATGCC"
on the illumina library sequences.

Luckily, the ncbi_blast was installed on the HPC. One can follow the link here run local blast
first, use the formatdb command to convert the file to a database

$ module load ncbi_blast

# check what arguments are for the command
$ formatdb -

formatdb 2.2.26   arguments:

  -t  Title for database file [String]  Optional
  -i  Input file(s) for formatting [File In]  Optional
  -l  Logfile name: [File Out]  Optional
    default = formatdb.log
  -p  Type of file
         T - protein
         F - nucleotide [T/F]  Optional
    default = T
  -o  Parse options
         T - True: Parse SeqId and create indexes.
         F - False: Do not parse SeqId. Do not create indexes.
 [T/F]  Optional
    default = F
  -a  Input file is database in ASN.1 format (otherwise FASTA is expected)
         T - True,
         F - False.
 [T/F]  Optional
    default = F
  -b  ASN.1 database in binary mode
         T - binary,
         F - text mode.
 [T/F]  Optional
    default = F
  -e  Input is a Seq-entry [T/F]  Optional
    default = F
  -n  Base name for BLAST files [String]  Optional
  -v  Database volume size in millions of letters [Integer]  Optional
    default = 4000
  -s  Create indexes limited only to accessions - sparse [T/F]  Optional
    default = F
  -V  Verbose: check for non-unique string ids in the database [T/F]  Optional
    default = F
  -L  Create an alias file with this name
        use the gifile arg (below) if set to calculate db size
        use the BLAST db specified with -i (above) [File Out]  Optional
  -F  Gifile (file containing list of gi's) [File In]  Optional
  -B  Binary Gifile produced from the Gifile specified above [File Out]  Optional
  -T  Taxid file to set the taxonomy ids in ASN.1 deflines [File In]  Optional


$ formatdb -i adaptors_list.fa -o F -p F

several other files with names ended with nhr, nin and nsq are generated in the same directory.
then run blast:

# check argument for balstall
$ blastall -
blastall 2.2.26   arguments:

  -p  Program Name [String]
  -d  Database [String]
    default = nr
  -i  Query File [File In]
    default = stdin
  -e  Expectation value (E) [Real]
    default = 10.0
  -m  alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = query-anchored no identities and blunt ends,
6 = flat query-anchored, no identities and blunt ends,
7 = XML Blast output,
8 = tabular,
9 tabular with comment lines
10 ASN, text
11 ASN, binary [Integer]
    default = 0
    range from 0 to 11
  -o  BLAST report Output File [File Out]  Optional
    default = stdout
  -F  Filter query sequence (DUST with blastn, SEG with others) [String]
    default = T
  -G  Cost to open a gap (-1 invokes default behavior) [Integer]
    default = -1
  -E  Cost to extend a gap (-1 invokes default behavior) [Integer]
    default = -1
  -X  X dropoff value for gapped alignment (in bits) (zero invokes default behavior)
      blastn 30, megablast 20, tblastx 0, all others 15 [Integer]
    default = 0
  -I  Show GI's in deflines [T/F]
    default = F
  -q  Penalty for a nucleotide mismatch (blastn only) [Integer]
    default = -3
  -r  Reward for a nucleotide match (blastn only) [Integer]
    default = 1
  -v  Number of database sequences to show one-line descriptions for (V) [Integer]
    default = 500
  -b  Number of database sequence to show alignments for (B) [Integer]
    default = 250
  -f  Threshold for extending hits, default if zero
      blastp 11, blastn 0, blastx 12, tblastn 13
      tblastx 13, megablast 0 [Real]
    default = 0
  -g  Perform gapped alignment (not available with tblastx) [T/F]
    default = T
  -Q  Query Genetic code to use [Integer]
    default = 1
  -D  DB Genetic code (for tblast[nx] only) [Integer]
    default = 1
  -a  Number of processors to use [Integer]
    default = 1
  -O  SeqAlign file [File Out]  Optional
  -J  Believe the query defline [T/F]
    default = F
  -M  Matrix [String]
    default = BLOSUM62
..........
..........

$ blastall -i overrepresented_sequence.fa -d adaptors_list.fa -p blastn -e 1e-6 

look at the result, I found that Adapter 2 is the contamination 
>Illumina_Single_End_Apapter_2
          Length = 34

 Score = 63.9 bits (32), Expect = 1e-14
 Identities = 32/32 (100%)
 Strand = Plus / Minus


Query: 1  gatcggaagagctcgtatgccgtcttctgctt 32
          ||||||||||||||||||||||||||||||||
Sbjct: 33 gatcggaagagctcgtatgccgtcttctgctt 2


I then feed Trimmomatic with this fasta file and removed the adaptor.



Tuesday, July 22, 2014

use python to change the header of a fasta file based on a dictionary in another file

I saw on SEQanwser, someone asked this question here, and I used python to resolve this problem.
I have not written in python for so long, I've been using R for a long time. This provide me a good practice opportunity.

The basic problem is to format some files, and I know it can be done by other tools like sed and awk ( see an awk solution below), but python just gives you much more flexibility in terms of what you can do and gives you more readability.
myfasta file

>comp1558_c0_seq1 len=204 path=[4976:0-203]
ATTGTACTTAATCTA.....
>comp8142_c0_seq1 len=222 path=[8835:0-221]
GTACAATCACGGCACATT......
>comp8570_c0_seq1 len=232 path=[3344:0-232]
GCATCGCTTATCGGTTTA......

annotation file:

comp1558_c0_seq1 repressor protein
comp8142_c0_seq1 aspartate aminotransferase
comp8357_c0_seq1 l-xylulose reductase
comp8387_c0_seq1 succinyl- synthetase alpha
comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase

see the code below:
with open ("C:/Users/Tang Ming/Desktop/anotation.txt", "r") as annotation:
anotation_dict = {}
for line in annotation:
line = line.split()
if line: #test whether it is an empty line
anotation_dict[line[0]]=line[1:]
else:
continue
# really should not parse the fasta file by myself. there are
# many parsers already there. you can use module from Biopython
ofile = open ("C:/Users/Tang Ming/Desktop/output.txt", "w")
with open ("C:/Users/Tang Ming/Desktop/my_fasta.txt", "r") as myfasta:
for line in myfasta:
if line.startswith (">"):
line = line[1:] # skip the ">" character
line = line.split()
if line[0] in anotation_dict:
new_line = ">" + str(line[0]) + " " + " ".join(anotation_dict[line[0]])
ofile.write ( new_line + "\n")
else:
ofile.write ( ">"+ "".join(line) + "\n")
else:
ofile.write(line +"\n")
ofile.close() # always remember to close the file.




output:
>comp1558_c0_seq1 repressor protein
ATTGTACTTAATCTA.....
>comp8142_c0_seq1 aspartate aminotransferase
GTACAATCACGGCACATT......
>comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase
GCATCGCTTATCGGTTTA......

if the dictionary file has only two records for each line, this awk should work:
http://stackoverflow.com/questions/11678939/replace-text-based-on-a-dictionary
NR == FNR {
  rep[$1] = $2
  next
} 

{
    for (key in rep) {
      gsub(key, rep[key])
    }
    print
}

Friday, July 4, 2014

get an idea of a gene expression value across samples by GEOquery

My next-door PI wanted to look at the TRIM29 expression levels in a series of tumor xenografts from a microarray data. I used GEOquery bioconductor package to get the log2 transformed values and plot a boxplot for him. see the code below. Also, I had a previous post on GEOquery http://crazyhottommy.blogspot.com/2013/12/geoquery-to-access-geo-datasets.html

library(Biobase)
library(GEOquery)
# load series and platform data from GEO
gset <- getGEO("GSE34412", GSEMatrix =TRUE)
gset<- gset[[1]]
# make proper column names to match toptable
fvarLabels(gset) <- make.names(fvarLabels(gset))
featureNames(gset)[1:10]
sampleNames(gset)[1:10]
# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
(qx[6]-qx[1] > 50 && qx[2] > 0) ||
(qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
exprs(gset) <- log2(ex) }
annotation(gset)
gpl <- getGEO('GPL8269')
#get annotation tabole
head(gpl)
gpl_ann<-Table(gpl)
names(gpl_ann)
head(gpl_ann)
gpl_ann[gpl_ann$"Blast Gene Symbol"=="TRIM29",]
exprs(gset)[1:4,1:4]
rownames(exprs(gset))<- gpl_ann$"Blast Gene Symbol"
index<- rownames(exprs(gset))=="TRIM29"
exprs(gset)[index,][1:4,1:4]
#only the first 25 xenografts data
Trim29_exprs<- exprs(gset)[index,][,1:25]
Trim29_exprs
#set the margin
par(mar=c(8,8,4,2))
boxplot(Trim29_exprs, ylab="Trim29 log2 transformed expression level", las=2) #make the lable vertical by specifying las=2
#############################################
names(pData(gset))
pData(gset)[,c(1,2,28)]
names(fData(gset))
fData(gset)$ID[1:20]
fData(gset)$GB_ACC[1:20]
fData(gset)$Blast.Gene.Symbol[1:20]
"TRIM29" %in% fData(gset)$Blast.Gene.Symbol
which (fData(gset)$Blast.Gene.Symbol=="TRIM29")


after got the Expression set object, I looked the names of the object:
> names(pData(gset))
 [1] "title"                    "geo_accession"            "status"                   "submission_date"        
 [5] "last_update_date"         "type"                     "channel_count"            "source_name_ch1"        
 [9] "organism_ch1"             "characteristics_ch1"      "biomaterial_provider_ch1" "molecule_ch1"          
[13] "extract_protocol_ch1"     "label_ch1"                "label_protocol_ch1"       "taxid_ch1"              
[17] "source_name_ch2"          "organism_ch2"             "characteristics_ch2"      "biomaterial_provider_ch2"
[21] "molecule_ch2"             "extract_protocol_ch2"     "label_ch2"                "label_protocol_ch2"    
[25] "taxid_ch2"                "hyb_protocol"             "scan_protocol"            "description"            
[29] "data_processing"          "platform_id"              "contact_name"             "contact_email"          
[33] "contact_phone"            "contact_department"       "contact_institute"        "contact_address"        
[37] "contact_city"             "contact_state"            "contact_zip/postal_code"  "contact_country"        
[41] "supplementary_file"       "data_row_count"

then, only look at the meta data:
> pData(gset)[,c(1,2,28)]
                                 title geo_accession  description
GSM847887                  ML-5998-TG1     GSM847887    Xenograft
GSM847888              Baylor 2147 TG6     GSM847888    Xenograft
GSM847890             Baylor 2665A TG6     GSM847890    Xenograft
GSM847891              Baylor 3107 TG5     GSM847891    Xenograft
GSM847892              Baylor 3143 TG5     GSM847892    Xenograft
GSM847893              Baylor 3204 TG5     GSM847893    Xenograft
GSM847894              Baylor 3561 TG5     GSM847894    Xenograft
GSM847895              Baylor 3611 TG5     GSM847895    Xenograft
GSM847896             Baylor 3613A TG5     GSM847896    Xenograft
GSM847898              Baylor 3807 TG5     GSM847898    Xenograft
GSM847901              Baylor 3887 TG5     GSM847901    Xenograft
GSM847902              Baylor 3904 TG5     GSM847902    Xenograft
GSM847903              Baylor 3936 TG5     GSM847903    Xenograft
GSM847904              Baylor 3963 TG5     GSM847904    Xenograft
GSM847905              Baylor 4013 TG1     GSM847905    Xenograft
GSM847907              Baylor 4175 TG1     GSM847907    Xenograft
GSM847908              Baylor 4195 TG4     GSM847908    Xenograft
GSM847909              Baylor 4272 TG1     GSM847909    Xenograft
GSM847911              Baylor 4664 TG1     GSM847911    Xenograft
GSM847914              Baylor 4888 TG1     GSM847914    Xenograft
GSM847915              Baylor 4913 TG1     GSM847915    Xenograft
GSM847917                  ML-4189-TG2     GSM847917    Xenograft
GSM847919                  ML-5097-TG2     GSM847919    Xenograft
GSM847920                  ML-5156-TG2     GSM847920    Xenograft
GSM847921                  ML-5471-TG2     GSM847921    Xenograft
GSM847922 9830-000060B NEW PROTOCOL V5     GSM847922 Tumor sample
GSM847923            9830-000094B-244K     GSM847923 Tumor sample
GSM847924          9830-000424B-244Kv5     GSM847924 Tumor sample
GSM847925          9830-000517B-244Kv5     GSM847925 Tumor sample
GSM848100 9830-010118B NEW PROTOCOL V5     GSM848100 Tumor sample
GSM848101 9830-010130B NEW PROTOCOL V5     GSM848101 Tumor sample
GSM848102 9830-010214B NEW PROTOCOL V5     GSM848102 Tumor sample
GSM848103 9830-010255B NEW PROTOCOL V5     GSM848103 Tumor sample
GSM848104 9830-010384B NEW PROTOCOL V5     GSM848104 Tumor sample
GSM848105          9830-010461B-244Kv5     GSM848105 Tumor sample
GSM848106        9830-020018B-244K2008     GSM848106 Tumor sample
GSM848107          9830-020025B-244Kv5     GSM848107 Tumor sample
GSM848108          9830-020039B-244Kv5     GSM848108 Tumor sample
GSM848109        9830-020185B-244K2008     GSM848109 Tumor sample
GSM848110          9830-020310B-244Kv5     GSM848110 Tumor sample
GSM848111 9830-020340B NEW PROTOCOL V5     GSM848111 Tumor sample
GSM848112 9830-020416B NEW PROTOCOL V5     GSM848112 Tumor sample
GSM848113        9830-030267B-244K2008     GSM848113 Tumor sample
GSM848114          9830-030446B-244Kv5     GSM848114 Tumor sample
GSM848115        9830-030597B-244K2008     GSM848115 Tumor sample
GSM848116         UNC-000279B-244K2008     GSM848116 Tumor sample
GSM848117         UNC-010208B-244K2008     GSM848117 Tumor sample
GSM848118           UNC-010224B-244Kv5     GSM848118 Tumor sample
GSM848119           UNC-010304B-244Kv5     GSM848119 Tumor sample
GSM848120         UNC-010509B-244K2008     GSM848120 Tumor sample
GSM848121           UNC-020155B-244Kv5     GSM848121 Tumor sample
GSM848122             UNC-020320B-244K     GSM848122 Tumor sample
GSM848123         UNC-020578B-244k-DGE     GSM848123 Tumor sample
GSM848124         UNC-030065B-244K2008     GSM848124 Tumor sample
GSM848125         UNC-030183B-244K2008     GSM848125 Tumor sample
GSM848126         UNC-030370B-244K2008     GSM848126 Tumor sample
GSM848127         UNC-030528B-244K2008     GSM848127 Tumor sample
GSM848128         UNC-040011B-244K2008     GSM848128 Tumor sample
GSM848129           UNC-960028B-244Kv5     GSM848129 Tumor sample
GSM848130           WashU-15720-244Kv5     GSM848130 Tumor sample

there are 10 probes for TRIM29 gene:
> gpl_ann[gpl_ann$"Blast Gene Symbol"=="TRIM29",]
           ID    GB_ACC SPOT_ID Public id       Probe NAME Blast Gene ID Blast Refseq ID Blast Gene Symbol
35972   35972 NM_012101         NM_012101 NM_012101_2_2683         23650     NM_012101.3            TRIM29
39854   39854 NM_012101         NM_012101     A_23_P340123         23650     NM_012101.3            TRIM29
76447   76447 NM_012101         NM_012101     A_23_P340123         23650     NM_012101.3            TRIM29
94361   94361 NM_012101         NM_012101     A_23_P203267         23650     NM_012101.3            TRIM29
97602   97602 NM_012101         NM_012101 NM_012101_2_2608         23650     NM_012101.3            TRIM29
111151 111151 NM_012101         NM_012101 NM_012101_2_2683         23650     NM_012101.3            TRIM29
120477 120477 NM_012101         NM_012101     A_23_P203260         23650     NM_012101.3            TRIM29
132274 132274 NM_012101         NM_012101     A_23_P203260         23650     NM_012101.3            TRIM29
187541 187541 NM_012101         NM_012101 NM_012101_2_2608         23650     NM_012101.3            TRIM29
190531 190531 NM_012101         NM_012101     A_23_P203267         23650     NM_012101.3            TRIM29
               Blast Gene Description Blast Chromosome Map Location
35972  tripartite motif-containing 29                     11q22-q23
39854  tripartite motif-containing 29                     11q22-q23
76447  tripartite motif-containing 29                     11q22-q23
94361  tripartite motif-containing 29                     11q22-q23
97602  tripartite motif-containing 29                     11q22-q23
111151 tripartite motif-containing 29                     11q22-q23
120477 tripartite motif-containing 29                     11q22-q23
132274 tripartite motif-containing 29                     11q22-q23
187541 tripartite motif-containing 29                     11q22-q23
190531 tripartite motif-containing 29                     11q22-q23

the boxplot of the TRIM29 expression levels across different samples: