Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, August 25, 2015

sort with header maintained

We usually want to maintain header when we do something for the body of the text.
You can find a body function here which can only accept streams and assumes only one line header.

Use subshells and awk can do the same job and potentially more flexible.



The original credits go to Aaron Quinlan. see a gist below to sort vcf files in natural chromosome order :chr1 chr2 chr3.... rather than chr1 chr10 chr11....

Thursday, August 13, 2015

2 cents on coding from a bioinformatics beginner


I was trained as a wet biologist and I started learning coding in 2012 April with my first ever python book: python programming for absolute beginners.  I still remember the days that after work I would sit down in front of the computer and go through the book until 10pm everyday.  

It was not that practical in terms of translating what I have learned to what I want to analyze in the lab, but still I have entered into a new world!

In the Fall semester of 2012, I took a beginner bioinformatics course at University of Florida using practical computing for biologists as a reference book.  It is a great book and it taught me regular expression, Unix commands and some python stuffs that directly related to biology. I was deeply attracted by the beauty of codes and was surprised/satisfied that how useful learning coding can be.

Lessons I learned from that class:
Regular expression is extremely useful! At least one needs to know the basics and you can then always google and find solutions there. 

Bioinformatics is a field that evolves so fast that many tools you use may become obsolete tomorrow. However, unix skills will never fade. I urge every wet biologist like me to learn Unix commands first. It will take time for you to be fluent in the terminal. It took me 2 years to feel really confortable working in the terminal, so stop worrying and take your time. 

Statistical programming language R is very popular in the bioinformatics field. I started using R because I can take advantage of the rich packages in bioconductor.  I started from the basics with The art of R programming. After getting the basics, learn to use packages like dplyr, ggplot2 will greatly reduce the complexity of your code and enhancer your productivity. Surprisingly, all these awesome packages were developed by the same person: Hadley Wikham.

Learn some git.  Git is a version control system that tracks your code. I am still a beginner, but I realized how important it is to version control my codes.  For this reason, I have a github repo where I put my codes.  I am still learning git everyday.

When the project grows big, you need to well manage it. There are several resources that I recommend you to read before any project:

      
2.    Designing project by Vince Buffalo Vince Buffalo has a book which I highly recommend for everyone: Bioinformatics data skills. It covers many points that I want to say in this post. I might write a review on it after finishing all the chapters.


The take home message for me is that it is not enough for you to just run the code, get some results and then publish them.

One needs to be aware that:
1.   Computers make mistakes. They can give you non-sense results and exit without error, so make extensive tests before running your code. 

2.   Share your codesEven your codes are correct, you need to share them so that other people can look at them and may improve them.

3.   Make your codes reusable. Do not hard code your scripts. If it takes a file path as input, make it as an argument in your scripts.

4.   Modulate your scripts. Data could come in different stage of formats.  Take ChIP-sequencing data analysis as an example, if you have a script that starts processing the data from fastq to the final peaks.  You may want to modulate your scripts to two modules: one for mapping fastq to bam, and the other for bam to peaks. Modulate your scripts so that one can use your script when the data come in a bam format.

5.  Heavily comment your scripts. It will not only make other people to understand your codes better, but also help the future you to understand what you did.


6.   You need to make your analysis reproducible. Each step of your analysis should be documented in a markdown file. I say every step, yes, every command that you strike in the terminal getting the intermediate files need to be taken down. Moreover, how, when and where did you download the data need to be documented. This will save the future you! Many experienced programmers overlook this point.

I am glad that I have come to this stage. I love what I am doing now and feel satisfied when I learn new things everyday. I want to encourage all the wet biologists: believe you can program as well :)


Tuesday, August 4, 2015

linux command join with header

See the ipython notebook here 

https://github.com/crazyhottommy/scripts-general-use/blob/master/Shell/Join_command.ipynb

linux commands are very powerful.

They are small but can do small things well.
Today, I am going to demonstrate a very useful linux command join.
join is used to join two files based on a common "ID" column.
This is a common task when you have information stored in two files
and have a common column to link them together.
I first created two dummy tab delimited files: A.txt and B.txt
Let's have a look at the files:
In [1]:
cat A.txt
Jack TX 5
John CA 3
Luke WA 4
Tammy FL 2
Tommy FL 1
In [2]:
cat B.txt
1 450
2 300
3 400
5 150
6 600
column 3 in file A and column 1 in file B are common IDs that link these two files together.
Let's do an inner join:
In [3]:
join -1 3 -2 1 A.txt B.txt
5 Jack TX 150
-1 means file 1, which is A file. 3 means the third column.
-2 means file 2, which is B file. 1 means the first column.
The output is not we expected. A file and B file should have 4 rows linked together.
**It turns out files need to be sorted first according to the common
field for join to work.** Let's sort them first and join them:
In [4]:
sort -k3,3 A.txt > A.sorted.txt
sort -k1,1 B.txt > B.sorted.txt
join -1 3 -2 1 A.sorted.txt B.sorted.txt
1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
5 Jack TX 150
we can also do right outer join (-a 1) :
In [5]:
join -1 3 -2 1 -a 1 A.sorted.txt B.sorted.txt
1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
Luke WA 4
5 Jack TX 150
The default input seprator is space or tab, and output seprator is space.
we can change the output seprator to tab:
In [6]:
join -t$'\t' -1 3 -2 1 -a 1 -o auto A.sorted.txt B.sorted.txt
1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
4 Luke WA 
5 Jack TX 150
Let's do a left outer join (-a 2) and fill the empty field with NA.
Note that only GNU join has the auto flag.
In [7]:
join -t$'\t' -e "NA" -1 3 -2 1 -a 2 -o auto A.sorted.txt B.sorted.txt
1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
5 Jack TX 150
6 NA NA 600
You may think the intermediate sorted file is annoying. One can avoid them using
process substitution:
In [8]:
join -1 3 -2 1 -t $'\t' <(sort -k3,3 A.txt) <(sort -k1,1 B.txt)
1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
5 Jack TX 150
Wow! It is so magic! Using process substitution can speed up your workflow.
So far, the files we have are without headers. What if they contain headers?
Let's make some dummy files first:
In [9]:
printf "name\tstate\tid\n" | cat - A.txt > A_header.txt
printf "id\tsalary\n" | cat - B.txt > B_header.txt

In [10]:
cat A_header.txt
name state id
Jack TX 5
John CA 3
Luke WA 4
Tammy FL 2
Tommy FL 1
In [11]:
cat B_header.txt
id salary
1 450
2 300
3 400
5 150
6 600
Let's join them together:
In [12]:
join -1 3 -2 1 -t $'\t' <(sort -k3,3 A_header.txt) <(sort -k1,1 B_header.txt)
1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
5 Jack TX 150
id name state salary
Unfortunately, after sorting, the header went to the bottom of the file. Many times, the header may just go to the middle of the file. We can always delete the header first,
join the file and then add the header back, but we will need to have intermediate files.
I will use body function which will ignore the header.
In [13]:
join -1 3 -2 1 -t $'\t' <(cat A_header.txt | body sort -k3,3) <(cat B_header.txt | body sort -k1,1)
id name state salary
1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
5 Jack TX 150