Diving into Genetics and Genomics: August 2014

This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Thursday, August 14, 2014

rename a bunch of files with bash by regular expression

I am at the MSU NGS 2014 course. My colleague wanted to follow the khmer protocol with his own data, but one of the steps has to use a certain file name convention.

In the protocol it requires fastq files listed as: *001_R1.fastq.gz

001 is the replicate number, it can be 002 or 003 or any number of replicates you have. ( for RNA-seq, sequence as many as biological samples as possible !)

R1 is the pair-end reads 1, it can be R2

What he has is something like:

1_egg_r1_01_sub.fastq.gz

1 is the stage of the egg. He sequenced 4 eggs, so he has 1_egg, 2_egg., 3_egg and 4_egg

r1 is the pair-end reads 1

01 is the first replicates. He has two replicates for each egg.

Basically, he wants to rename these files to the khmer convention.

This problem gets down to writing a regular expression.

To recapture the problem, I made some dummy files:

mkdir foo && cd foo

I have a txt file contains the names of the file:

foo$ cat files.txt

1egg_r1_01_sub.fastq.gz

1egg_r2_01_sub.fastq.gz

1egg_r1_02_sub.fastq.gz

1egg_r2_02_sub.fastq.gz

2egg_r1_01_sub.fastq.gz

2egg_r2_01_sub.fastq.gz

2egg_r1_02_sub.fastq.gz

2egg_r2_02_sub.fastq.gz

3egg_r1_01_sub.fastq.gz

3egg_r2_01_sub.fastq.gz

3egg_r1_02_sub.fastq.gz

3egg_r2_02_sub.fastq.gz

4egg_r1_01_sub.fastq.gz

4egg_r2_01_sub.fastq.gz

4egg_r1_02_sub.fastq.gz

4egg_r2_02_sub.fastq.gz

Now I want to make dummy files with the names in this file.

one can make the dummy files in a fly also.

=====update on 08/26/14======
one can use the {} expansion to create the dummy files

tommy@tommy-ThinkPad-T420[foo] touch {1,2,3,4}_r{1,2}_0{1,2}_sub.fastq.gz
tommy@tommy-ThinkPad-T420[foo] ls [ 3:45PM]
1_r1_01_sub.fastq.gz 2_r2_01_sub.fastq.gz 4_r1_01_sub.fastq.gz
1_r1_02_sub.fastq.gz 2_r2_02_sub.fastq.gz 4_r1_02_sub.fastq.gz
1_r2_01_sub.fastq.gz 3_r1_01_sub.fastq.gz 4_r2_01_sub.fastq.gz
1_r2_02_sub.fastq.gz 3_r1_02_sub.fastq.gz 4_r2_02_sub.fastq.gz
2_r1_01_sub.fastq.gz 3_r2_01_sub.fastq.gz
2_r1_02_sub.fastq.gz 3_r2_02_sub.fastq.gz

========================

The difference of make_dummy_file.sh and make_dummy_file_1.sh is that I specified shebang line in the make_dummy_file.sh script to tell the bash that it is a bash script, to invoke it: ./make_dummy_file.sh files.txt

In contrast, to invoke the other two which I did not specify the shebang: bash make_dummy_file_1.sh bash make_dummy_file_2.sh

Rename the files with regular expression by either using sed or rename command

the rename command use the perl regular expression. use \ to escape $.
the sed command need to escape the () which are used to capture the back reference
before:
tommy@tommy-ThinkPad-T420:~/foo$ ls 1_egg_r1_01_sub.fastq.gz 2_egg_r1_01_sub.fastq.gz 3_egg_r1_01_sub.fastq.gz 4_egg_r1_01_sub.fastq.gz copy make_dummy_file_1.sh 1_egg_r1_02_sub.fastq.gz 2_egg_r1_02_sub.fastq.gz 3_egg_r1_02_sub.fastq.gz 4_egg_r1_02_sub.fastq.gz dummy make_dummy_file_2.sh 1_egg_r2_01_sub.fastq.gz 2_egg_r2_01_sub.fastq.gz 3_egg_r2_01_sub.fastq.gz 4_egg_r2_01_sub.fastq.gz files.txt rename.sh 1_egg_r2_02_sub.fastq.gz 2_egg_r2_02_sub.fastq.gz 3_egg_r2_02_sub.fastq.gz 4_egg_r2_02_sub.fastq.gz make_dummy_file.sh rename_one_liner.sh

after:
tommy@tommy-ThinkPad-T420:~/foo$ ls 1egg_R1_001.fastq.gz 2egg_R1_001.fastq.gz 3egg_R1_001.fastq.gz 4egg_R1_001.fastq.gz copy make_dummy_file_1.sh 1egg_R1_002.fastq.gz 2egg_R1_002.fastq.gz 3egg_R1_002.fastq.gz 4egg_R1_002.fastq.gz dummy make_dummy_file_2.sh 1egg_R2_001.fastq.gz 2egg_R2_001.fastq.gz 3egg_R2_001.fastq.gz 4egg_R2_001.fastq.gz files.txt rename.sh 1egg_R2_002.fastq.gz 2egg_R2_002.fastq.gz 3egg_R2_002.fastq.gz 4egg_R2_002.fastq.gz make_dummy_file.sh rename_one_liner.sh

References: http://stackoverflow.com/questions/399078/what-special-characters-must-be-escaped-in-regular-expressions

http://stackoverflow.com/questions/10929453/bash-scripting-read-file-line-by-line
https://www.cs.tut.fi/~jkorpela/perl/regexp.html

Friday, August 8, 2014

R commands basics

we are on day 5 of the MSU NGS course. This morning, Ian Dworkin introduced some basic R.
I found it refreshing and put a gist below.
Pick one language, and learn it well!
pick up a dataset, and play with it! Happy coding!
By the way, the food here at KBS is amazing, I am gaining weight :)

Thursday, August 7, 2014

Understanding the Forward strand and Reverse strand and the coordinates systems

we are on day 4 of the MSU NGS course. In the morning, instructor Istvan introduced Genomic Intervals. To understand the coordinates system, one needs to understand the strandness of DNA.

sense strand is the coding strand
anti-sense strand is the reverse-complementary strand of the coding strand
see details below:
https://www.biostars.org/p/3423/
https://www.biostars.org/p/3908/
again, everything is on biostar :)

I drew a picture to better understand it

remember:
coordinates are reported 5'---> 3' forward strand
transcription occurs from 5' to 3'
forward/plus strand and reverse/reverse strand are designated arbitrarily.
Imagine that you can flip over the example I drew, then gene A would be in minus strand.

# 0-based and 1-based coordinates system

0 based and 1 based coordinates cheat sheet
https://www.biostars.org/p/84686/

various formats: http://genome.ucsc.edu/FAQ/FAQformat.html
GFF3 specification: http://www.sequenceontology.org/gff3.shtml
0-based formats:BED, wiggle, BEDGRAPH
1-based formats: GFF, GTF, GBK (genebank file), SAM, VCF

# lift over coordinates
lift-over between different versions of genome https://genome.ucsc.edu/util.html
Generally do not do it, just map to the right version of interest.
By the way, the latest human genome GRCh38 is released: http://www.ensembl.info/blog/2014/08/07/ensembl-76-has-been-released/

Wednesday, August 6, 2014

linux commands basics

I am attending the NGS course at MSU. This is a great course with great instructors and friendly colleagues.
I highly recommend this course to everyone. http://bioinformatics.msu.edu/ngs-summer-course-2014

This morning, we learned SNP calling by samtools and sam file specification (I will write another blog for the SNP calling) .in the night , TA Elijah gave an awesome introduction to linux commands.
personally, I think this should be taught in the first day of the course. ( I am already pretty familiar with basic linux commands, but it does cause a lot of frustrations for beginners).

I took the notes, and put the commands that taught in a gist, see below and enjoy linux commands!

linux basics by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Diving into Genetics and Genomics

My github papge