Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, August 14, 2014

rename a bunch of files with bash by regular expression

I am at the MSU NGS 2014 course. My colleague wanted to follow the khmer protocol with his own data, but one of the steps has to use a certain file name convention.

In the protocol it requires fastq files listed as:   *001_R1.fastq.gz
001 is the replicate number, it can be 002 or 003 or any number of replicates you have. ( for RNA-seq, sequence as many as biological samples as possible !)
R1 is the pair-end reads 1, it can be R2

What he has is something like:
1_egg_r1_01_sub.fastq.gz

1 is the stage of the egg. He sequenced  4 eggs, so he has 1_egg, 2_egg., 3_egg and 4_egg
r1 is the pair-end reads 1
01 is the first replicates. He has two replicates for each egg.

Basically, he wants to rename these files to the khmer convention.

This problem gets down to writing a regular expression.
To recapture the problem, I made some dummy files:

mkdir foo && cd foo

I have a txt file contains the names of the file:
foo$ cat files.txt 
1egg_r1_01_sub.fastq.gz
1egg_r2_01_sub.fastq.gz
1egg_r1_02_sub.fastq.gz
1egg_r2_02_sub.fastq.gz
2egg_r1_01_sub.fastq.gz
2egg_r2_01_sub.fastq.gz
2egg_r1_02_sub.fastq.gz
2egg_r2_02_sub.fastq.gz
3egg_r1_01_sub.fastq.gz
3egg_r2_01_sub.fastq.gz
3egg_r1_02_sub.fastq.gz
3egg_r2_02_sub.fastq.gz
4egg_r1_01_sub.fastq.gz
4egg_r2_01_sub.fastq.gz
4egg_r1_02_sub.fastq.gz
4egg_r2_02_sub.fastq.gz

Now I want to make dummy files with the names in this file.
one can make the dummy files in a fly also.

=====update on 08/26/14======
one can use the {} expansion to create the dummy files

tommy@tommy-ThinkPad-T420[foo] touch {1,2,3,4}_r{1,2}_0{1,2}_sub.fastq.gz
tommy@tommy-ThinkPad-T420[foo] ls                                     [ 3:45PM]
1_r1_01_sub.fastq.gz  2_r2_01_sub.fastq.gz  4_r1_01_sub.fastq.gz
1_r1_02_sub.fastq.gz  2_r2_02_sub.fastq.gz  4_r1_02_sub.fastq.gz
1_r2_01_sub.fastq.gz  3_r1_01_sub.fastq.gz  4_r2_01_sub.fastq.gz
1_r2_02_sub.fastq.gz  3_r1_02_sub.fastq.gz  4_r2_02_sub.fastq.gz
2_r1_01_sub.fastq.gz  3_r2_01_sub.fastq.gz
2_r1_02_sub.fastq.gz  3_r2_02_sub.fastq.gz




========================
!# /usr/bin/bash
while read name
do
echo "Name read from file - $name"
touch $name
done < $1
while read name
do
echo "Name read from file - $name"
touch $name
done < files.txt
for i in 1 2 3 4
do
for j in 1 2
do
for k in 1 2
do
touch $i\_egg\_r$j\_0$k\_sub.fastq.gz
done
done
done
The difference of make_dummy_file.sh and make_dummy_file_1.sh is that I specified shebang line in the make_dummy_file.sh script to tell the bash that it is a bash script, to invoke it: ./make_dummy_file.sh files.txt

In contrast, to invoke the other two which I did not specify the shebang: bash make_dummy_file_1.sh bash make_dummy_file_2.sh

Rename the files with regular expression by either using sed or rename command
for fspec1 in *.gz
do
#echo $fspec1
fspec2=$(echo ${fspec1} | sed "s/\([1-4]egg\)_r\([1-2]\)_0\([1-2]\)_sub.fastq.gz/\1_R\3_00\2.fastq.gz/")
echo $fspec2
mv ${fspec1} ${fspec2}
done
view raw rename.sh hosted with ❤ by GitHub
rename "s/([1-4]egg)_r([1-2])_0([1-2])_sub.fastq.gz/\$1_R\$3_00\$2.fastq.gz/" *fastq.gz

the rename command use the perl regular expression. use \ to escape $.
the sed command need to escape the () which are used to capture the back reference
before:
tommy@tommy-ThinkPad-T420:~/foo$ ls 1_egg_r1_01_sub.fastq.gz 2_egg_r1_01_sub.fastq.gz 3_egg_r1_01_sub.fastq.gz 4_egg_r1_01_sub.fastq.gz copy make_dummy_file_1.sh 1_egg_r1_02_sub.fastq.gz 2_egg_r1_02_sub.fastq.gz 3_egg_r1_02_sub.fastq.gz 4_egg_r1_02_sub.fastq.gz dummy make_dummy_file_2.sh 1_egg_r2_01_sub.fastq.gz 2_egg_r2_01_sub.fastq.gz 3_egg_r2_01_sub.fastq.gz 4_egg_r2_01_sub.fastq.gz files.txt rename.sh 1_egg_r2_02_sub.fastq.gz 2_egg_r2_02_sub.fastq.gz 3_egg_r2_02_sub.fastq.gz 4_egg_r2_02_sub.fastq.gz make_dummy_file.sh rename_one_liner.sh 

after:
tommy@tommy-ThinkPad-T420:~/foo$ ls 1egg_R1_001.fastq.gz 2egg_R1_001.fastq.gz 3egg_R1_001.fastq.gz 4egg_R1_001.fastq.gz copy make_dummy_file_1.sh 1egg_R1_002.fastq.gz 2egg_R1_002.fastq.gz 3egg_R1_002.fastq.gz 4egg_R1_002.fastq.gz dummy make_dummy_file_2.sh 1egg_R2_001.fastq.gz 2egg_R2_001.fastq.gz 3egg_R2_001.fastq.gz 4egg_R2_001.fastq.gz files.txt rename.sh 1egg_R2_002.fastq.gz 2egg_R2_002.fastq.gz 3egg_R2_002.fastq.gz 4egg_R2_002.fastq.gz make_dummy_file.sh rename_one_liner.sh

References: http://stackoverflow.com/questions/399078/what-special-characters-must-be-escaped-in-regular-expressions

http://stackoverflow.com/questions/10929453/bash-scripting-read-file-line-by-line
https://www.cs.tut.fi/~jkorpela/perl/regexp.html

Friday, August 8, 2014

R commands basics

we are on day 5 of the MSU NGS course. This morning, Ian Dworkin introduced some basic R.
I found it refreshing and put a gist below.
Pick one language, and learn it well!
pick up a dataset, and play with it!  Happy coding!
By the way, the food here at KBS is amazing, I am gaining weight :)


#2014 MSU NGS R basics tutorial
#http://angus.readthedocs.org/en/2014/R_Introductory_tutorial_2014.html
#https://github.com/jrherr/quick_basic_R_tutorial/blob/master/R_tutorial.md
#pick one language, and learn it well!
#pick up a dataset, play with it!
#object-oriented programming
#functional programming
#deal with big data in R: (R holds all the data in memory)
#http://theodi.org/blog/fig-data-11-tips-how-handle-big-data-r-and-1-bad-pun
#http://r-pbd.org/
#packages: plyr, dplyr, ggplot2, reshape2, data.table (fread function)
# commands start here!!
q() # quit R
getwd() # get working directory
setwd() # set working directory
y<- 2 # assign a variable
x<- 3
x + y # treat it as a calculator
x * y
x/y
x %% y
x == 3 # equal sign, will reture a logical vector: True or False
# in R, True and False have numerical values: True resolves to 1 , False resolves to 0
# exponents ** or ^
2**2 # returns 4
2^2 # returns 4
log(2.7) # natural log returns 0.99325
log(4,2) # returns 2
a<- c(2,3,6,8) # assign a vector use c denotes concatenate
b<- c(3,5,6,7)
a + b #
a * b #
# when length of a and b are different, R will recycle the longer one and gives a warning
length(a) # length of a, returns 4
new_varaible <- c(a,b) # concatenate two variables
crap<- rep(1:100) # index starts at 1
rm (crap) # remove this variable
?lm # get help for linear regression model function
# simple functions
a<- c(2,3,6,8)
mean(a)
sum(a)
var(a)
b<- c(3,5,6,7)
cor(a,b) # pearson correlation for two vectors
m<- cbind(a,b) # column bind the vector to a matrix
m
#> m
# a b
# [1,] 2 3
# [2,] 3 5
# [3,] 6 6
# [4,] 8 7
cor(m) # pearson correlation for columns of a matrix
cor(m, method="spearman") # spearman correlation of columns of a matrix
?cor
mode (a) # numeric
class (a) # numeric
class (m) # matrix
typeof(m)
str(m) # structure of m, try it out in your R console
dim(m) # dimension of m: 4 2
nrow(m) # number of rows 4
ncol(m) # number of columns 2
length(m) # 8
is.matrix(m) # True
# create a matrix from scratch
#> m1<- matrix(1:12,3,4)
#> m1
# [,1] [,2] [,3] [,4]
#[1,] 1 4 7 10
#[2,] 2 5 8 11
#[3,] 3 6 9 12
# strings
cities<- c("E.lansing", "Gainesville", "Shanghai", "Yichun")
class(cities)
#[1] "character"
length(cities) # 4 cities in the vector
nchar(cities) # number of characters for each city
#[1] 9 11 8 6
sum(nchar(cities))
#[1] 34
rivers<- c("Red Cedar", "swamp", "Huang Pu", "Long He")
cities_rivers<- cbind (cities,rivers)
cities_rivers
# cities rivers
#[1,] "E.lansing" "Red Cedar"
#[2,] "Gainesville" "swamp"
#[3,] "Shanghai" "Huang Pu"
#[4,] "Yichun" "Long He"
class(cities_rivers) # matrix
mode (citeis_rivers) # character
model_1<- y ~ x1 + x2 + x1:x2
model_1
#> class(model_1)
#[1] "formula"
counts_transcript_a <- c(250, 157, 155, 300, 125, 100, 153, 175)
genotype <- gl(n=2, k=4, labels = c("wild_type", "mutant"))
#> genotype
#[1] wild_type wild_type wild_type wild_type mutant
#[6] mutant mutant mutant
#Levels: wild_type mutant
#alternative to gl function, one can
genotype1<- factor(rep(c("wild_type","mutant"),each=4))
#> genotype1
#[1] wild_type wild_type wild_type wild_type mutant
#[6] mutant mutant mutant
#Levels: mutant wild_type
#notice the use of the "each" argument
genotype1<- factor(rep(c("wild_type","mutant"),4))
#> genotype1
#[1] wild_type mutant wild_type mutant wild_type
#[6] mutant wild_type mutant
#Levels: mutant wild_type
#also, notice that the levels are different with that generated by gl function
#we want the wild_type to be the base level. Instead, do:
genotype1<- factor(rep(c("wild_type","mutant"),each=4), levels=c("wild_type","mutant"))
#> genotype1
#[1] wild_type wild_type wild_type wild_type mutant mutant mutant mutant
#Levels: wild_type mutant
?relevel # try it also
expression_data <- data.frame(counts_transcript_a, genotype)
#> expression_data
# counts_transcript_a genotype
#1 250 wild_type
#2 157 wild_type
#3 155 wild_type
#4 300 wild_type
#5 125 mutant
#6 100 mutant
#7 153 mutant
#8 175 mutant
expression_data$counts_transcript_a # access a column of the dataframe
ls() # objects in the enviroment
rm(list=ls()) # remove all the objects in the enviroment
### write functions
StdErr <- function(vector) {
sd(vector)/sqrt(length(vector))
}
CoefVar<- function(vector){
sd(vector)/mean(vector)
}
# apply families http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
# with
> with(expression_data, tapply(X=counts_transcript_a, INDEX=genotype, FUN=mean))
wild_type mutant
215.50 138.25
# some commonly used functions, try ?to understand them
head() # print the first 6 lines, different with linux (default 10 lines)
table()
rownames()
colnames()
nrow()
ncol()
by()
with()
rowSums()
rowMeans()
summary()
# construct sequences
one_to_20<- 1:20
twenty_to_1<- 20:1
seq1<- seq(from =1, to = 20, by 0.5)
# or seq1<- seq(1,20,0.5)
# repeat numbers
many_2<- rep(2, times=20)
many_a<- rep("a", times=20)
seq_rep<- rep(1:10, times=2)
rep_3_times<- rep(c(1,2,3), times=3)
# different
rep_each_3_times<- rep(c(1,2,3), each=3)
# to do: subsetting for vectors and matrix
# R mark-down

Thursday, August 7, 2014

Understanding the Forward strand and Reverse strand and the coordinates systems

we are on day 4 of the MSU NGS course. In the morning, instructor Istvan introduced Genomic Intervals. To understand the coordinates system, one needs to understand the strandness of DNA.

sense strand is the coding strand
anti-sense strand is the reverse-complementary strand of the coding strand
see details below:
https://www.biostars.org/p/3423/
https://www.biostars.org/p/3908/
again, everything is on biostar :)


I drew a picture to better understand it


remember:
coordinates are reported 5'---> 3'  forward strand
transcription occurs from 5' to 3'
forward/plus strand and reverse/reverse strand are designated arbitrarily.
Imagine that you can flip over the example I drew, then gene A would be in minus strand.

# 0-based and 1-based coordinates system

0 based and 1 based coordinates  cheat sheet
https://www.biostars.org/p/84686/

various formats: http://genome.ucsc.edu/FAQ/FAQformat.html
GFF3 specification: http://www.sequenceontology.org/gff3.shtml
0-based formats:BED, wiggle, BEDGRAPH
1-based formats: GFF, GTF, GBK (genebank file), SAM, VCF


# lift over coordinates
lift-over between different versions of genome https://genome.ucsc.edu/util.html
Generally do not do it, just map to the right version of interest.
By the way, the latest human genome GRCh38 is released: http://www.ensembl.info/blog/2014/08/07/ensembl-76-has-been-released/

Wednesday, August 6, 2014

linux commands basics

I am attending the NGS course at MSU. This is a great course with great instructors and friendly colleagues.
I highly recommend this course to everyone. http://bioinformatics.msu.edu/ngs-summer-course-2014

This morning, we learned SNP calling by samtools and sam file specification (I will write another blog for the SNP calling) .in the night , TA Elijah gave an awesome introduction to linux commands.
personally, I think this should be taught in the first day of the course. ( I am already pretty familiar with basic linux commands, but it does cause a lot of frustrations for beginners).

I took the notes, and put the commands that taught in a gist, see below and enjoy linux commands!

#linux commands basics
#http://software-carpentry.org/v5/novice/shell/index.html
# practise, practise, practise, google, google, google and you will get it :)
pwd # print working directory
cd # change directory
sudo # super user privilege
chmod 775 # change the privileges http://en.wikipedia.org/wiki/Chmod
git clone # version control! get to know git and github! http://git-scm.com/
sudo bash # bad habit
whoami # who you are
exit
logout
echo # it's like print
ls -sh # give the size information in MB etc
ls -F # color your folders vs files
ls -l # long format
ls -a # list everything including invisible files
ls /home/tommy/data # list files in the data folder by specifying a full path
clear # clear the screen
# remember the tab magic auto-fill
# clarify the filesystem: slash / denotes the root directory ~ denotes the home directory
cd - # change directory to your previous directory
cd ~ or just cd # change directory to your home directory
cd .. # change directory one level up
cd ../.. # change directory to two levels up
cp /home/tommy/data.txt . # copy data.txt to the current directory which denoted by .
# create things txt editors!
nano # use this one, because it is easy
emacs # do not use it until later stage
vi # 90% of the people do not know how to exit vim :)
cat mydata.txt # print the content of mydata.txt to the stand output(the screen)
cat mydata1.txt mydata2.txt > total_data.txt # concatenate two files to a new file
ctrl + D # quit (indicate cat that this is the last line) when you do something like: cat > mydata.txt
head -15 # print the first 15 lines to the screen
tail # print the last 10 lines to the screen by default
less # use less to peek txt files. less is more
#get help
man # man cat will show you all the flags (arguments) for cat command
info
# control + c to quit any command
rm # dangerous remove files (permanetly gone)
rm -fr #remove directory
move # rename files
move mydata.txt my_new_data.txt # rename mydata.txt to my_new_data.txt
cp # copy files. do not put space in your file names in linux, just remember it.
#############################
# text manipulation again use man to explore the flags for all the commands below or google it!
sed # a book can be dedicated to explain the usage of it http://www.thegeekstuff.com/sed-awk-101-hacks-ebook/
awk # a book can be dedicated to explain the usage of it http://www.thegeekstuff.com/sed-awk-101-hacks-ebook/
cut # one of the most frequently commands I use, cut the columns out of a txt file
cut -f1,3 mydata.txt # cut out the first and third columns out and print it to screen
paste # paste two files side by side
comm # look common lines of two files, files need to be sorted first.
comm -12 # will supress the unique lines in each file
diff # look different lines of two files.
wc # word count
wc -l # line number
grep ATCG mydata.txt | head # find lines containing ATCG in mydata.txt and look at it by head
grep -v ATCG mydata.txt > no_ATCG.txt # find lines not containing ATCG and redirect to a new file named no_ATCG.txt
nl # number lines of your output
sort -k2,2 -nr # reverse numerical sort based on my second column
###############################
#up-arrow to reuse your previous command
control + r #reverse search your command
history # your command history
# redirect with >, instead print the output in the command screen, redirect it to a new file
head -n 1000 mydata.txt > newdata.txt
# redirect the first 1000 lines to a new file named newdata.txt
ls . > file_names.txt
# redirect all the file names in the current directory to a file named file_names.txt
##########################
# pipes | where the power comes
history | less # look at your history by less, type q to quit less
cat mydata.txt | sort -k2,2 -nr > whole_sorted_data.txt # I am a useless cat fan
# the upper one equals to
sort -k2,2 -nr mydata.txt> whole_sorted_data.txt
head -n 1000 mydata.txt | sort -k2,2 -nr > my_selected_sorted_data.txt
###########################
# use wildcard characters * ? .
# again regular expressions! http://regexone.com/
##################### loops
for file in *.xml; do head -3 $file; done
for file in *.xml; do head -3 $file >> all_head3_xml.txt; done
for file in *.xml; do head -3 ${file} > ${file}.txt; done # rename the files by adding txt suffix keeping the old name of the *.xml files.
Creative Commons License
linux basics by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.