A wet-dry hybrid biologist's take on genetics and genomics. Mostly is about Linux, R, python, reproducible research, open science and NGS. Grab my book to transform yourself to a computational biologist https://divingintogeneticsandgenomics.ck.page/
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
It is a very simple script but several things I want to mention:
1. what is the fast string concatenation method in python?
in the first function, we tend to write something like:
out_seq = ' '
for base in reversed(seq):
out_seq += seq_dict[base]
but it turned out to be very slow. Read this post for more details http://www.skymind.com/~ocrow/python_string/
the list comprehension is the fastest and most elegant way to concatenate a string.
Nucleotide Code: Base:
---------------- -----
A.................Adenine
C.................Cytosine
G.................Guanine
T (or U)..........Thymine (or Uracil)
R.................A or G
Y.................C or T
S.................G or C
W.................A or T
K.................G or T
M.................A or C
B.................C or G or T
D.................A or G or T
H.................A or C or T
V.................A or C or G
N.................any base
. or -............gap
GRO-seq (Global run on sequencing) is a technique to capture nascent RNAs in a global manner. Different from RNA-seq which sequences the matured mRNA to infer gene expression level, GRO-seq sequences the newly generated short RNAs to infer RNApol2 binding or the production of enhancer RNAs.
It was first carried out by John T.Lis at Cornell University. See their science paper in 2008: Nascent RNA Sequencing Reveals Widespread Pausing and Divergent Initiation at Human Promoters http://www.ncbi.nlm.nih.gov/pubmed/19056941
Since then, many labs use this technique to study their own subjects. Michael G. Rosenfeldfrom UCSD just published a nature paper in June,
Functional roles of enhancer RNAs for oestrogen-dependent transcriptional activation
although they have a bigwig file there which one can direct upload to UCSC genome browser http://www.biostars.org/p/42844/ or local IGV to visualize, the data were mapped to hg18.
I want to have a hg19 version. So, I went to download the raw sequence sra file
Essentially, the sequences are mapped back to genomic DNA as you see.
It took me around 30 mins to finish the mapping which is pretty fast compared with RNA-seq data.
using Tophat, same configuration as the pbs file above, it takes me 50hrs to finish mapping. RNA-seq data are bigger (~25Gb fastq) and more complicated when mapping (figure out exon-exon junctions).
Now, I have the sam file after bowtie mapping.
Next, I want to generate a bedGraph file for visualization.
I strongly recommend you read the tutorials to understand the various sequencing techniques. Although the formatting is not that great, it is still pretty informative.
first make Tag directory
makeTagDirectory MCF7_gro_seq_ctrl/ SRR817000.sam
then make the bedgraph file for plus and minus strand respectively:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The first track is the minus-strand bedgraph got from Homer
the second track is the plus-strand bedgraph got from Homer
the third track is the plus-strand bedgraph got from bedtools genomecov
I see clearly short RNAs spiking at the promoter of VEGFa, suggesting that RNApol2 paused at promoter and does not undergo productive elongation. This gene is actively transcribed with full-length mRNA, but the promoter is still paused. ( read the 2008 science GRO-seq paper for more details of different groups of genes)
one thing I do noticed is that the bedgraph file from bedtools is bigger than the one from Homer 87M vs 16M, and if I zoom in I saw some small differences:
It looks like the bedgraph from bedtools exhibits a more detailed view of the data, but overall either bedgraph file would be fine for a quick inspection of interested genes.( I guess Homer uses some bigger bin for calculating or merge bins with small difference together to save space).
I have been using ubuntu 12.04LTS for one year, and I really love it and do not look back at windows at all. Currently, I have dual booting system for linux and windows, but now I barely fire the windows.
Most commonly used ones for me are:
head, tail, wc, tr, sort, uniq, cat, cut, paste, join, grep(or the more powerful ack), find, xargs, comm, diff, awk, sed etc
Everyday, I learn new stuffs.
yesterday, from twitter, I learned this one:
print hidden characters ( cat -v is not what I want) tommy@tommy-ThinkPad-T420:~$ sed -n l contig >contig1$ ATCGGGTC$ >contig2$ GCTCGTTCAA$ >contig3$ TACGGGGT$
it can also print out the tabs as \t
I have a bed file that contains the Transcription start sites information, some genes have multiple records because they have isoforms, but I only want to use the unique TSSs for my downstream analysis. of course, you can do cut -f1-3 | sort | uniq, but I just show you another way to do it by awk
I think this could be solved using a perl script. I am not able to do this since i am a beginner. Can someone help me out? A simple perl script is good enough for me^^ "
The question is on very basic text manipulation, and I quickly wrote a python script for that: import csv reader = csv.reader(open("GO.txt","r"), delimiter="\t") new={} for row in reader: if row[0] not in new.keys(): new[row[0]] = [row[1]] else: new[row[0]].append(row[1]) with open("wego.txt","w") as f: for key, value in sorted(new.items()): f.write(key+"\t"+"\t".join(value)+"\n")
update 10/19/13, I used the gist in github to display the code
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
the question is a little bit different from the one above:
"I have a large file containing data like this:
a 23
b 8
a 22
b 1
I want to be able to get this:
a 45
b 9
"
awk array is very handy for that purpose see a post here http://www.thegeekstuff.com/2010/03/awk-arrays-explained-with-5-practical-examples/:
let's first creat the file: tommy@tommy-ThinkPad-T420:~$ cat > foo.txt a23 b8 a22 b1 #control+D to indicate cat this is the end of the file. tommy@tommy-ThinkPad-T420:~$ cat foo.txt | awk '{new[$1]+=$2}END{for (key in new) print key, new[key]}' a 45 b 9 #or you can redirect the result to a new file
I do not know whether the array value can hold multiple strings like a list in python.
-------------------------------------------
updated on 10/21/13, an awk one liner can solve this problem
tommy@tommy-ThinkPad-T420:~$ cat foo.txt | awk '{ if (new[$1]) new[$1]=new[$1]"\t"$2; else new[$1]=$2;} END { for (i in new) print i, new[i]}' OFS="\t" a2322 b81
5 layers of expertise in bioinformatics were mentioned:
"Layer 1 – Using web to analyze biological data
Layer 2 – Ability to install and run new programs
Layer 3 – Writing own scripts for analysis in PERL, python or R
Layer 4 – High level coding in C/C++/Java for implementing existing algorithms or modifying existing codes for new functionality
Layer 5 – Thinking mathematically, developing own algorithms and implementing in C/C++/Java"
level 1-3 are required for all the biologists, while level 4-5 are left to experts. I am trying to accomplish level 3, and that will be good enough for my research:)
Box 1: Does every new biology PhD student need to learn how to program?
Durbin: That's a little like asking is it good for a molecular biologist to know chemistry. I would say that computation is now as important to biology as chemistry is. Both are useful background knowledge. Data manipulation and use of information are part of the technology of biology research now. Knowing how to program also gives people some idea about what's going on inside data analysis. It helps them appreciate what they can and can't expect from data analysis software.
Trapnell: It's probably not just that experimental biologists need to program, but it's also enormously helpful when computational folks learn how to do experiments. For me, for example, coming from a computer science background, the opposite way of thinking was hard to learn. How do I learn to argue with wet-lab data? How do I learn what to trust, what to distrust, how to cross-validate things? That's a radically different way of thinking when you're used to proofs and writing code and validating it on a computer.
Krzywinski: To some, the answer might be “no” because that's left to the experts, to the people downstairs who sit in front of a computer. But a similar question would be: does every graduate student in biology need to learn grammar? Clearly, yes. Do they all need to learn to speak? Clearly, yes. We just don't leave it to the literature experts. That's because we need to communicate. Do students need to tie their shoes? Yes. It has now come to the point where using a computer is as essential as brushing your teeth. If you want some kind of a competitive edge, you're going to want to make as much use of that computer as you can. The complexity of the task at hand will mean that canned solutions don't exist. It means that if you're using a canned solution, you're not at the edge of research.
Robinson: Yes. Even if they don't program in their research, they will have to use software and likely will communicate with software developers. It helps tremendously to have some basic knowledge. Additionally, in the research environment, the ability to do basic tasks in the Linux/Unix environment is essential.
Rasband: All scientists should learn how to program."
I have a bed file which contains promoter regions. I want to get 1000 random promoters from it. First I think I would write a python script using the random module, but before I do it, I did a quick google search since it is such an ordinary job, there may be tools out there already.
I have done a lot of ChIP-seq data analysis, but most of them are just mapping to reference genome and then calling peaks (use MACS from Shirley liu's lab at Harvard). A little bit more is to intersect different peak bed files with bedtools and to make heatmaps.
The reads mapping is much simpler and faster than RNA-seq data. The data are smaller (~5Gb compare to RNA-seq ~25Gb). I recently asked myself how to compare different ChIP-seq data for different conditions ( control vs knockdown, different developmental stages etc).
The available packages are: "- ChIPDiff, which you already mentioned but which I haven't tried, but it was actually developed for histone marks so I'd be surprised if it didn't work for those?! I recall someone saying that there was some other issue with it (sorry I can't be more specific)
- You can also use edgeR and DESeq directly. The DESeq paper shows you how to re-analyze differential TF binding data in the Kasowski et al Science paper (http://www.sciencemag.org/content/328/5975/232.short). "
"The fact that in the above example and many others throughout the genome that there are overlapping exons. If we don't collapse them, then reads assigned to those overlapping regions could be counted multiple times and provide an inaccurate profile of expression.
The solution is to merge overlapping exons right? Well yes, but there is another major caveat in that exons from different genes can sometimes have overlapping exons and we can't collapse exons from different genes, otherwise we wouldn't be able to distinguish them. The solution is to merge exons which only belong to the same gene. If there is a read which aligns to a region with more than one annotation, then it is counted twice because we can't differentiate between the two genes at that position."
I am going to stick to the nature protocol using htseq-count (Simon Anders wrote this python package HTSeq), but please keep that in mind. Different sources of gtf files are a little bit different.
GFF attribute to be used as feature ID. Several GFF lines with the same feature ID will be considered as parts of the same feature. The feature ID is used to identity the counts in the output table. The default, suitable for RNA-SEq and Ensembl GTF files, is gene_id.
on the HPC cluster:
module load htseq
htseq-count -s no -a 10 -t exon -i gene_name my_sn.sam gencode.v18.annotation.gtf > my.count
the sam file is around 25Gb, it took around 1.5 hr to finish the counting for each file.
Finally, I got the count table for all of them.
first several lines of the count file
0R7H2P 20
5S_rRNA 54
7SK 170
A1BG 14
A1BG-AS1 64
A1CF 292
A2M 252
A2M-AS1 14
A2ML1 352
A2ML1-AS1 14
A2ML1-AS2 4
A2MP1 91
Now, I am ready to use DESeq R package for the downstream analysis. I will post later.