Diving into Genetics and Genomics: April 2014

This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Monday, April 14, 2014

order a matrix by a pre-defined group and then by the rowSums in R

I will lay down the R code below

#create a matrix

> mat<- matrix(sample.int(10, size=30, replace=T), ncol=3)
> mat
      [,1] [,2] [,3]
 [1,]    4    7    6
 [2,]    5    6    1
 [3,]    4    3    1
 [4,]    2    2    9
 [5,]    4    5    4
 [6,]    8    9    8
 [7,]    5    9    7
 [8,]    8    6    7
 [9,]    3    8    5
[10,]    4    3    6

# column bind the mat and the sum of the rows together

> mat1<- cbind(mat, rowSums(mat))
> mat1
      [,1] [,2] [,3] [,4]
 [1,]    4    7    6   17
 [2,]    5    6    1   12
 [3,]    4    3    1    8
 [4,]    2    2    9   13
 [5,]    4    5    4   13
 [6,]    8    9    8   25
 [7,]    5    9    7   21
 [8,]    8    6    7   21
 [9,]    3    8    5   16
[10,]    4    3    6   13
# order the matrix by the rowSums(mat)
> mat[order(rowSums(mat)),]
      [,1] [,2] [,3]
 [1,]    4    3    1
 [2,]    5    6    1
 [3,]    2    2    9
 [4,]    4    5    4
 [5,]    4    3    6
 [6,]    3    8    5
 [7,]    4    7    6
 [8,]    5    9    7
 [9,]    8    6    7
[10,]    8    9    8
# compare with the order by looking at the last column
> mat1[order(rowSums(mat)),]
      [,1] [,2] [,3] [,4]
 [1,]    4    3    1    8
 [2,]    5    6    1   12
 [3,]    2    2    9   13
 [4,]    4    5    4   13
 [5,]    4    3    6   13
 [6,]    3    8    5   16
 [7,]    4    7    6   17
 [8,]    5    9    7   21
 [9,]    8    6    7   21
[10,]    8    9    8   25
# order the mat matrix in a decreasing order by adding a minus sign 
> mat[order(-rowSums(mat)),]
      [,1] [,2] [,3]
 [1,]    8    9    8
 [2,]    5    9    7
 [3,]    8    6    7
 [4,]    4    7    6
 [5,]    3    8    5
 [6,]    2    2    9
 [7,]    4    5    4
 [8,]    4    3    6
 [9,]    5    6    1
[10,]    4    3    1
# label the rows into different groups
> rownames(mat)<- c("f","a","d","c","b","d","e","a","f","f")
> mat
  [,1] [,2] [,3]
f    4    7    6
a    5    6    1
d    4    3    1
c    2    2    9
b    4    5    4
d    8    9    8
e    5    9    7
a    8    6    7
f    3    8    5
f    4    3    6
> rownames(mat1)<- c("f","a","d","c","b","d","e","a","f","f")
> mat1
  [,1] [,2] [,3] [,4]
f    4    7    6   17
a    5    6    1   12
d    4    3    1    8
c    2    2    9   13
b    4    5    4   13
d    8    9    8   25
e    5    9    7   21
a    8    6    7   21
f    3    8    5   16
f    4    3    6   13
> groups<- factor(rownames(mat))
> groups
 [1] f a d c b d e a f f
Levels: a b c d e f
# order the matrix by group first, and then within the subgroups, order by the rowSums
> mat1[order(groups,rowSums(mat)),]
  [,1] [,2] [,3] [,4]
a    5    6    1   12
a    8    6    7   21
b    4    5    4   13
c    2    2    9   13
d    4    3    1    8
d    8    9    8   25
e    5    9    7   21
f    4    3    6   13
f    3    8    5   16
f    4    7    6   17

It is extremely useful when I want to order the matrix by a pre-defined way and then plot the heatmap with heatmap2 in gplots package.

Friday, April 11, 2014

rehead a bam file

I downloaded some ChIP-seq bam files from the UCSC genome browser http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/

and I found some bam files headers are without chrY information

samtools view -H my.sorted.bam
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr19 LN:59128983
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chrX LN:155270560
@SQ SN:chrM LN:16571

for some downstream analysis, I have to rehead this bam file:

samtools view -H my.bam | sed '24 a @SQ SN:chrY LN:59373566' | samtools reheader - my.bam > my.reheaded.bam

sed to append a line "@SQ SN:chrY LN:59373566" at the 24th line

ctr+v+tab to insert a tab in the terminal.

Friday, April 4, 2014

liftover wig file

I was looking at a public ChIP-seq data set, but the author did not put the raw sequence data in the GEO database. Instead, they uploaded a wig file of the ChIP-seq signal file and it was mapped to human hg18 reference genome.

All the other data sets I am analyzing are mapped to hg19, so I have to liftover this wig file to hg19 also.

Solution: CrossMap! http://crossmap.sourceforge.net/

Convert Wiggle/BigWig format files

Wiggle (WIG) format is useful for displaying continuous data such as GC content and reads intensity of high-throughput sequencing data. BigWig is a self-indexed binary-format Wiggle file, and has the advantage of supporting random access. This means only regions that need to be displayed are retrieved by genome browser, and it dramatically reduces the time needed for data transferring (Kent et al., 2010). Input wiggle data can be in variableStep (for data with irregular intervals) or fixedStep (for data with regular intervals). Regardless of the input, the output will always in bedGraph format. bedGraph format is similar to wiggle format and can be converted into BigWig format using UCSC wigToBigWig tool. We export files in bedGraph because it is usually much smaller than file in wiggle format, and more importantly, CrossMap internally transforms wiggle into bedGraph to increase running speed.

If an input file is in BigWig format, the output is BigWig format if UCSC’s ‘wigToBigWig‘ executable can be found; otherwise, the output file will be in bedGraph format.

Typing command without any arguments will print help message:

$ python2.7 CrossMap.py  wig

Screen output:

Usage:
  CrossMap.py wig input_chain_file input_wig_file output_prefix

Description:
  "input_chain_file" can be regular or compressed (*.gz, *.Z, *.z, *.bz, *.bz2,
  *.bzip2) file, local file or URL (http://, https://, ftp://) pointing to remote
  file.  Both "variableStep" and "fixedStep" wiggle lines are supported. Wiggle
  format: http://genome.ucsc.edu/goldenPath/help/wiggle.html

Example:
  CrossMapy.py wig hg18ToHg19.over.chain.gz test.hg18.wig test.hg19

It is very easy to use, after install it following the instruction I did:
CrossMap.py wig hg18ToHg19.over.chain.gz my.wig my_hg19

it generates a bigwig file, a sorted bedgraph and a unsorted bedgraph file.

it took me 36mins to convert a 1.4Gb wig file on my desktop with 4Gb ram.

Diving into Genetics and Genomics

My github papge