We usually want to maintain header when we do something for the body of the text.
You can find a body function here which can only accept streams and assumes only one line header.
Use subshells and awk can do the same job and potentially more flexible.
The original credits go to Aaron Quinlan. see a gist below to sort vcf files in natural chromosome order :chr1 chr2 chr3.... rather than chr1 chr10 chr11....
A wet-dry hybrid biologist's take on genetics and genomics. Mostly is about Linux, R, python, reproducible research, open science and NGS. Grab my book to transform yourself to a computational biologist https://divingintogeneticsandgenomics.ck.page/
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Tuesday, August 25, 2015
Thursday, August 13, 2015
2 cents on coding from a bioinformatics beginner
I was trained as a wet
biologist and I started learning coding in 2012 April with my first ever python
book: python programming for absolute beginners. I still remember the days that after work I would
sit down in front of the computer and go through the book until 10pm
everyday.
It was not that practical in
terms of translating what I have learned to what I want to analyze in the lab, but
still I have entered into a new world!
In the Fall semester of 2012,
I took a beginner bioinformatics course at University of Florida using practical computing for biologists as a reference book.
It is a great book and it taught me regular expression, Unix commands
and some python stuffs that directly related to biology. I was deeply attracted
by the beauty of codes and was surprised/satisfied that how useful learning
coding can be.
Lessons I learned from that class:
Regular expression is extremely useful! At least one needs to know the basics and you can
then always google and find solutions there.
Bioinformatics is a field
that evolves so fast that many tools you use may become obsolete tomorrow.
However, unix skills will never fade. I
urge every wet biologist like me to learn Unix commands first. It will take
time for you to be fluent in the terminal. It took me 2 years to feel really
confortable working in the terminal, so stop worrying and take your time.
Statistical programming
language R is very popular in the bioinformatics field. I started using R because I can take advantage of the rich packages in bioconductor. I started from the basics with The art of R programming. After getting the basics, learn to use packages like dplyr, ggplot2 will greatly reduce the
complexity of your code and enhancer your productivity. Surprisingly, all these
awesome packages were developed by the same person: Hadley Wikham.
Learn some git. Git is a version control system that tracks
your code. I am still a beginner, but I realized how important it is to version
control my codes. For this reason, I have a github repo
where I put my codes. I am still learning git everyday.
When the project grows big,
you need to well manage it. There are several resources that I recommend you to
read before any project:
2. Designing project by Vince Buffalo Vince Buffalo has a book
which I highly recommend for everyone: Bioinformatics data skills. It covers many points that I
want to say in this post. I might write a review on it after finishing all the
chapters.
The take home message for me
is that it is not enough for you to just run the code, get some results and
then publish them.
One needs to be aware that:
1.
Computers make
mistakes. They can give you non-sense results and exit without error, so make
extensive tests before running your code.
2. Share your codes. Even your codes
are correct, you need to share them so that other people can look at them and
may improve them.
3.
Make your codes
reusable. Do not hard code your scripts. If it takes a file path as input, make
it as an argument in your scripts.
4. Modulate your scripts. Data could come
in different stage of formats. Take
ChIP-sequencing data analysis as an example, if you have a script that starts
processing the data from fastq to the final peaks. You may want to modulate your scripts to two
modules: one for mapping fastq to bam, and the other for bam to peaks. Modulate your scripts so that one can
use your script when the data come in a bam format.
5. Heavily comment your scripts. It will not only make other people to understand your codes better, but also help the future you to understand what you did.
5. Heavily comment your scripts. It will not only make other people to understand your codes better, but also help the future you to understand what you did.
6.
You need to make
your analysis reproducible. Each step of your analysis should be documented in
a markdown file. I say every step, yes, every command that you strike in the
terminal getting the intermediate files need to be taken down. Moreover, how,
when and where did you download the data need to be documented. This will save the future you! Many
experienced programmers overlook this point.
I am glad that I have come to this stage. I love what I am doing now and feel satisfied when I learn new things everyday. I want to encourage all the wet biologists: believe you can program as well :)
Tuesday, August 4, 2015
linux command join with header
See the ipython notebook here
https://github.com/crazyhottommy/scripts-general-use/blob/master/Shell/Join_command.ipynb
linux commands are very powerful.
They are small but can do small things well.
Today, I am going to demonstrate a very useful linux command
This is a common task when you have information stored in two files
and have a common column to link them together.
Today, I am going to demonstrate a very useful linux command
join
.join
is used to join two files based on a common "ID" column.This is a common task when you have information stored in two files
and have a common column to link them together.
I first created two dummy tab delimited files: A.txt and B.txt
Let's have a look at the files:
Let's have a look at the files:
In [1]:
cat A.txt
Jack TX 5 John CA 3 Luke WA 4 Tammy FL 2 Tommy FL 1
In [2]:
cat B.txt
1 450 2 300 3 400 5 150 6 600
column 3 in file A and column 1 in file B are common IDs that link these two files together.
Let's do an inner join:
Let's do an inner join:
In [3]:
join -1 3 -2 1 A.txt B.txt
5 Jack TX 150
-1 means file 1, which is A file. 3 means the third column.
-2 means file 2, which is B file. 1 means the first column.
The output is not we expected. A file and B file should have 4 rows linked together.
**It turns out files need to be sorted first according to the common
field for join to work.** Let's sort them first and join them:
-2 means file 2, which is B file. 1 means the first column.
The output is not we expected. A file and B file should have 4 rows linked together.
**It turns out files need to be sorted first according to the common
field for join to work.** Let's sort them first and join them:
In [4]:
sort -k3,3 A.txt > A.sorted.txt
sort -k1,1 B.txt > B.sorted.txt
join -1 3 -2 1 A.sorted.txt B.sorted.txt
1 Tommy FL 450 2 Tammy FL 300 3 John CA 400 5 Jack TX 150
we can also do right outer join (-a 1) :
In [5]:
join -1 3 -2 1 -a 1 A.sorted.txt B.sorted.txt
1 Tommy FL 450 2 Tammy FL 300 3 John CA 400 Luke WA 4 5 Jack TX 150
The default input seprator is space or tab, and output seprator is space.
we can change the output seprator to tab:
we can change the output seprator to tab:
In [6]:
join -t$'\t' -1 3 -2 1 -a 1 -o auto A.sorted.txt B.sorted.txt
1 Tommy FL 450 2 Tammy FL 300 3 John CA 400 4 Luke WA 5 Jack TX 150
Let's do a left outer join (-a 2) and fill the empty field with NA.
Note that only GNU join has the auto flag.
Note that only GNU join has the auto flag.
In [7]:
join -t$'\t' -e "NA" -1 3 -2 1 -a 2 -o auto A.sorted.txt B.sorted.txt
1 Tommy FL 450 2 Tammy FL 300 3 John CA 400 5 Jack TX 150 6 NA NA 600
You may think the intermediate sorted file is annoying. One can avoid them using
process substitution:
process substitution:
In [8]:
join -1 3 -2 1 -t $'\t' <(sort -k3,3 A.txt) <(sort -k1,1 B.txt)
1 Tommy FL 450 2 Tammy FL 300 3 John CA 400 5 Jack TX 150
Wow! It is so magic! Using process substitution can speed up your workflow.
So far, the files we have are without headers. What if they contain headers?
Let's make some dummy files first:
Let's make some dummy files first:
In [9]:
printf "name\tstate\tid\n" | cat - A.txt > A_header.txt
printf "id\tsalary\n" | cat - B.txt > B_header.txt
In [10]:
cat A_header.txt
name state id Jack TX 5 John CA 3 Luke WA 4 Tammy FL 2 Tommy FL 1
In [11]:
cat B_header.txt
id salary 1 450 2 300 3 400 5 150 6 600
Let's join them together:
In [12]:
join -1 3 -2 1 -t $'\t' <(sort -k3,3 A_header.txt) <(sort -k1,1 B_header.txt)
1 Tommy FL 450 2 Tammy FL 300 3 John CA 400 5 Jack TX 150 id name state salary
Unfortunately, after sorting, the header went to the bottom of the file. Many times, the header may just go to the middle of the file. We can always delete the header first,
join the file and then add the header back, but we will need to have intermediate files.
I will use
join the file and then add the header back, but we will need to have intermediate files.
I will use
body
function which will ignore the header.
In [13]:
join -1 3 -2 1 -t $'\t' <(cat A_header.txt | body sort -k3,3) <(cat B_header.txt | body sort -k1,1)
id name state salary 1 Tommy FL 450 2 Tammy FL 300 3 John CA 400 5 Jack TX 150
Subscribe to:
Posts (Atom)