Diving into Genetics and Genomics

This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Thursday, October 30, 2014

Test significance of overlapping genes from 2 gene sets

The question is very straightforward:
I have three microarray or RNA-seq data sets (control, treatment1, treatment2, each should have several replicates, otherwise you can not get a pvalue for differential expression!) , I then get two sets of differentially up-regulated genes ( treatment1 vs control, treatment2 vs control). There are some overlapping genes which are both up-regulated by treatment1 and treatment2. How significant the overlapping is?

The easiest way is to use an online program http://nemates.org/MA/progs/overlap_stats.html
The program calculates the p value by using Hypergeometric_distribution.

see below for a python and an R script for this problem.

read posts here http://stats.stackexchange.com/questions/16247/calculating-the-probability-of-gene-list-overlap-between-an-rna-seq-and-a-chip-c
http://blog.nextgenetics.net/?e=94
https://www.biostars.org/p/90662/

list1=3000, list2=400, Total gene number (population)=15000,
overlapping between list1 and list2 =100
From the stackexchange link:
"The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.

sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)

These give 0.0078.

dhyper(x, m, n, k) gives the probability of drawing exactly x. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.

phyper(x, m, n, k) gives the probability of getting x or fewer, so phyper(x, m, n, k) is the same as sum(dhyper(0:x, m, n, k)).

The lower.tail=FALSE is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE) is the same as 1-phyper(x, m, n, k), and so is the probability of x+1 or more "

If I run with the python script, I got the same p-value:

tommy@tommy-ThinkPad-T420 ~/scripts_general_use
$ ./hyper_geometric.py 15000 3000 400 100

total number in population: 15000
total number with condition in population: 3000
number in subset: 400
number with condition in subset: 100

p-value <= 100: 0.996049535673

p-value >= 100: 0.00784035207977

The bottom line is that we need more statistics!!

Thursday, October 23, 2014

convert bam file to bigwig file and visualize in UCSC genome browser in a Box (GBiB)

I just came across UCSC genome browser in a box Gbib, and I want to try it out.

what is Gbib?
from the link above:

"Genome Browser in a Box (GBiB) is a "virtual machine" of the entire UCSC Genome Browser website that is designed to run on most PCs (Windows, Mac OSX or Linux). GBiB allows you to access much of the UCSC Genome Browser's functionality from the comfort of your own computer. It is particularly directed at individuals who want to use the Genome Browser's functionality to view protected data. With the anticipation that the majority of protected data use will focus on the human genomes (primarily GRCh37/hg19, and transitioning to GRCh38/hg38), the current version of GBiB has been optimized for use with the hg19 assembly. Many other recent genome assemblies can also be viewed, but without mirroring of additional data, they may be slow."

Download the compressed file from here https://genome-store.ucsc.edu/ and follow the installation instructions. Open virtual box and start the virtual machine and then direct your web browser to 127.0.0.1:1234, you should see the UCSC genome browser page.

It looks the same as the UCSC genome browser page https://genome.ucsc.edu/. You can then go to the browser and do what you usually do in the genome browser. when I searched a gene, and errors came out:

nextRow failed
mySQL error 1194: Table 'rmsk' is marked as crashed and should be repaired (profile=db, host=localhost, db=hg19)

I asked the UCSC mail list, and I typed the command below in the console:
sudo mysqlcheck --all-databases --auto-repair --fast
it works like a charm now!

I have some bam files and I want to visualize them in the UCSC genome browser. I would first convert them to bigwig http://genome.ucsc.edu/goldenpath/help/bigWig.html and then upload to some open http https or ftp locations. see my old post:
hosting bigwig by Dropbox for UCSC

However, with GBiB one only needs to share a local folder containing the bigwig files with virtual box.

"Loading Local Big Data Tracks

Your computer can share directories with GBiB so that you can load big files without the need to upload them to a web server. The big file formats are compressed, indexed, binary files, and include bigBed,bigWig, BAM, and VCF files. Normally, one would have to place these types files onto a publicly accessible web server to upload them as a custom track. However, GBiB acts as its own web server, meaning that you can share local files with GBiB for easy uploading as a custom track."

I will start from converting bam to bigwig.
several posts you may want to have a look:
https://www.biostars.org/p/64495/#64680
https://www.biostars.org/p/2699/
http://onetipperday.blogspot.com/2012/07/three-ways-to-convert-bambed-file-to.html

I used genomeCoverageBed from bedtools to convert bam to bedgraph first, and then used UCSC utilities bedClip and bedGraphToBigWig http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ to do the conversion.

bedtools requires the bam files be sorted. so samtools sort was used to sort the bam file first.
I used a script from Tao Liu (the author of famous MACS ChIP-seq calling software) to do the conversion after I got the bedgraph files. I then put the bigwig files in a folder and shared it with virtual box. Follow the instructions http://genome.ucsc.edu/goldenPath/help/gbib#LocalTracks. I successfully visualized a bunch of bigwig files!

Wednesday, October 22, 2014

speed up grep

I had a list of gene names in txt file. There are around 500 genes with one gene name in one line, and I want to filter the gtf file from ensemble Homo_sapines.GRCh37.74gtf.gz

the gtf file contains 2244857 lines. I used grep to do it, but it takes very long (~1 hour).

what I used:

zcat Homo_sapines.GRCh37.74gtf.gz | grep -f gene_names.txt -w > my_genes.gtf

I searched on line, and found several posts in stackoverflow to speed up grep:
http://stackoverflow.com/questions/14602963/faster-grep-function-for-big-27gb-files
http://stackoverflow.com/questions/13913014/grepping-a-huge-file-80gb-any-way-to-speed-it-up
http://stackoverflow.com/questions/9066609/fastest-possible-grep

options to speed up:

1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

2) Use fgrep because you're searching for a fixed string, not a regular expression

I then used:

zcat Homo_sapines.GRCh37.74gtf.gz | LC_ALL=C fgrep -f gene_names.txt -w > my_genes.gtf

It runs much faster!

Friday, September 26, 2014

everyone will write a blog post on dplyr

dply is a nice R package to manipulate big data. There are several good tutorials I came across:

http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/
http://stat545-ubc.github.io/block009_dplyr-intro.html
http://stat545-ubc.github.io/bit001_dplyr-cheatsheet.html
http://gettinggeneticsdone.blogspot.com/2014/08/do-your-data-janitor-work-like-boss.html

I followed the example in the early edition of Bioinformatics Data Skills on the dplyr part and put a gist below. The example used dplyr to manipulate and summarize the human annotation gff file downloaded here http://useast.ensembl.org/info/data/ftp/index.html.

I downloaded the ensemble 74 build which is different from the 75 build in the example of the book.
I believe the gencode v19 annotation file is based on ensemble 74.

I also included comments in the R code describing how to do the same job with linux command line.

Thursday, September 11, 2014

make a soft link

I got to know the ln linux command when I was reading The linux command line. I did not realize why I need this command to make links to the same data or file.

One situation I faced is that I have a lot of bam files with long names in one folder, and when I execute a program in another folder, I have to pass the program a full path to the bam files.

while you think the TAB auto-completion helps most of the time, it is still too much to specify the full path + long names (ugly). So I want to make a copy of the bam files in the current working director to save me from typing the full path. This is where ln command comes to rescue.

The other scenario is when you have multiple executable files in one folder ~/myprogram/bin, you want to execute the program from anywhere, you could link it to /usr/local/bin, or you can just copy the executable to /usr/local/bin or you can add ~/myprogram/bin to $PATH. see a link here http://unix.stackexchange.com/questions/116508/adding-to-path-vs-linking-from-bin

Let me demonstrate the usage of ln

look at the man page

man ln

# I have two files in the playground directory and make two directories

mkdir dir1 dir2
ls

output: tommy@tommy-ThinkPad-T420[playground] ls [ 2:32PM]
dir1 dir2 genes_h.txt genes.txt

I then move the two txt files into dir1 and redirect to dir2 and make soft links of files in dir1 to dir2

mv *txt dir1
cd dir2
ln -s ../dir1/*txt .
ls

if we go to dir1 and make a softlink from there

cd ../dir1
ln -s ./*txt ../dir2

The links will not work (too many levels of symbolic levels error message if you want to use the link ), instead you need to specify the full path

ln -s $PWD/*txt ../dir2

see links here http://www.thegeekstuff.com/2010/10/linux-ln-command-examples/
and http://www.thegeekstuff.com/2010/10/linux-ln-command-examples/

Wednesday, September 10, 2014

converting gene ids using bioconductor with biomaRt and annotation packages

I had a post using mygene to convert gene ids. Bioconductor can do the same job.
I put a gist on github.

For more examples see posts from Dave Tang:
http://davetang.org/muse/2013/12/16/bioconductor-annotation-packages/
http://davetang.org/muse/2013/05/23/using-the-bioconductor-annotation-packages/
http://davetang.org/muse/2013/11/25/thoughts-converting-gene-identifiers/

Wednesday, September 3, 2014

mapping gene ids with mygene

Mapping gene ids is one of the routine jobs for bioinformatics. I was aware of several ways to do it including Biomart.

Update on 10/30/14, a mygene bioconductor package is online http://bioconductor.org/packages/release/bioc/html/mygene.html

Recently I got to know mygene, a python wrapper for the mygene.info services to map gene ids.
I found it very handy to convert gene ids. see a gist below.

To use it, cat input.txt | python geneSymbol2Entrez.py > output.txt
or python geneSymbol2Entrez.py input.txt > output.txt where input.txt contains one gene name in each line. pretty neat!

Thursday, August 14, 2014

rename a bunch of files with bash by regular expression

I am at the MSU NGS 2014 course. My colleague wanted to follow the khmer protocol with his own data, but one of the steps has to use a certain file name convention.

In the protocol it requires fastq files listed as: *001_R1.fastq.gz

001 is the replicate number, it can be 002 or 003 or any number of replicates you have. ( for RNA-seq, sequence as many as biological samples as possible !)

R1 is the pair-end reads 1, it can be R2

What he has is something like:

1_egg_r1_01_sub.fastq.gz

1 is the stage of the egg. He sequenced 4 eggs, so he has 1_egg, 2_egg., 3_egg and 4_egg

r1 is the pair-end reads 1

01 is the first replicates. He has two replicates for each egg.

Basically, he wants to rename these files to the khmer convention.

This problem gets down to writing a regular expression.

To recapture the problem, I made some dummy files:

mkdir foo && cd foo

I have a txt file contains the names of the file:

foo$ cat files.txt

1egg_r1_01_sub.fastq.gz

1egg_r2_01_sub.fastq.gz

1egg_r1_02_sub.fastq.gz

1egg_r2_02_sub.fastq.gz

2egg_r1_01_sub.fastq.gz

2egg_r2_01_sub.fastq.gz

2egg_r1_02_sub.fastq.gz

2egg_r2_02_sub.fastq.gz

3egg_r1_01_sub.fastq.gz

3egg_r2_01_sub.fastq.gz

3egg_r1_02_sub.fastq.gz

3egg_r2_02_sub.fastq.gz

4egg_r1_01_sub.fastq.gz

4egg_r2_01_sub.fastq.gz

4egg_r1_02_sub.fastq.gz

4egg_r2_02_sub.fastq.gz

Now I want to make dummy files with the names in this file.

one can make the dummy files in a fly also.

=====update on 08/26/14======
one can use the {} expansion to create the dummy files

tommy@tommy-ThinkPad-T420[foo] touch {1,2,3,4}_r{1,2}_0{1,2}_sub.fastq.gz
tommy@tommy-ThinkPad-T420[foo] ls [ 3:45PM]
1_r1_01_sub.fastq.gz 2_r2_01_sub.fastq.gz 4_r1_01_sub.fastq.gz
1_r1_02_sub.fastq.gz 2_r2_02_sub.fastq.gz 4_r1_02_sub.fastq.gz
1_r2_01_sub.fastq.gz 3_r1_01_sub.fastq.gz 4_r2_01_sub.fastq.gz
1_r2_02_sub.fastq.gz 3_r1_02_sub.fastq.gz 4_r2_02_sub.fastq.gz
2_r1_01_sub.fastq.gz 3_r2_01_sub.fastq.gz
2_r1_02_sub.fastq.gz 3_r2_02_sub.fastq.gz

========================

The difference of make_dummy_file.sh and make_dummy_file_1.sh is that I specified shebang line in the make_dummy_file.sh script to tell the bash that it is a bash script, to invoke it: ./make_dummy_file.sh files.txt

In contrast, to invoke the other two which I did not specify the shebang: bash make_dummy_file_1.sh bash make_dummy_file_2.sh

Rename the files with regular expression by either using sed or rename command

the rename command use the perl regular expression. use \ to escape $.
the sed command need to escape the () which are used to capture the back reference
before:
tommy@tommy-ThinkPad-T420:~/foo$ ls 1_egg_r1_01_sub.fastq.gz 2_egg_r1_01_sub.fastq.gz 3_egg_r1_01_sub.fastq.gz 4_egg_r1_01_sub.fastq.gz copy make_dummy_file_1.sh 1_egg_r1_02_sub.fastq.gz 2_egg_r1_02_sub.fastq.gz 3_egg_r1_02_sub.fastq.gz 4_egg_r1_02_sub.fastq.gz dummy make_dummy_file_2.sh 1_egg_r2_01_sub.fastq.gz 2_egg_r2_01_sub.fastq.gz 3_egg_r2_01_sub.fastq.gz 4_egg_r2_01_sub.fastq.gz files.txt rename.sh 1_egg_r2_02_sub.fastq.gz 2_egg_r2_02_sub.fastq.gz 3_egg_r2_02_sub.fastq.gz 4_egg_r2_02_sub.fastq.gz make_dummy_file.sh rename_one_liner.sh

after:
tommy@tommy-ThinkPad-T420:~/foo$ ls 1egg_R1_001.fastq.gz 2egg_R1_001.fastq.gz 3egg_R1_001.fastq.gz 4egg_R1_001.fastq.gz copy make_dummy_file_1.sh 1egg_R1_002.fastq.gz 2egg_R1_002.fastq.gz 3egg_R1_002.fastq.gz 4egg_R1_002.fastq.gz dummy make_dummy_file_2.sh 1egg_R2_001.fastq.gz 2egg_R2_001.fastq.gz 3egg_R2_001.fastq.gz 4egg_R2_001.fastq.gz files.txt rename.sh 1egg_R2_002.fastq.gz 2egg_R2_002.fastq.gz 3egg_R2_002.fastq.gz 4egg_R2_002.fastq.gz make_dummy_file.sh rename_one_liner.sh

References: http://stackoverflow.com/questions/399078/what-special-characters-must-be-escaped-in-regular-expressions

http://stackoverflow.com/questions/10929453/bash-scripting-read-file-line-by-line
https://www.cs.tut.fi/~jkorpela/perl/regexp.html

Friday, August 8, 2014

R commands basics

we are on day 5 of the MSU NGS course. This morning, Ian Dworkin introduced some basic R.
I found it refreshing and put a gist below.
Pick one language, and learn it well!
pick up a dataset, and play with it! Happy coding!
By the way, the food here at KBS is amazing, I am gaining weight :)

Thursday, August 7, 2014

Understanding the Forward strand and Reverse strand and the coordinates systems

we are on day 4 of the MSU NGS course. In the morning, instructor Istvan introduced Genomic Intervals. To understand the coordinates system, one needs to understand the strandness of DNA.

sense strand is the coding strand
anti-sense strand is the reverse-complementary strand of the coding strand
see details below:
https://www.biostars.org/p/3423/
https://www.biostars.org/p/3908/
again, everything is on biostar :)

I drew a picture to better understand it

remember:
coordinates are reported 5'---> 3' forward strand
transcription occurs from 5' to 3'
forward/plus strand and reverse/reverse strand are designated arbitrarily.
Imagine that you can flip over the example I drew, then gene A would be in minus strand.

# 0-based and 1-based coordinates system

0 based and 1 based coordinates cheat sheet
https://www.biostars.org/p/84686/

various formats: http://genome.ucsc.edu/FAQ/FAQformat.html
GFF3 specification: http://www.sequenceontology.org/gff3.shtml
0-based formats:BED, wiggle, BEDGRAPH
1-based formats: GFF, GTF, GBK (genebank file), SAM, VCF

# lift over coordinates
lift-over between different versions of genome https://genome.ucsc.edu/util.html
Generally do not do it, just map to the right version of interest.
By the way, the latest human genome GRCh38 is released: http://www.ensembl.info/blog/2014/08/07/ensembl-76-has-been-released/