Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, October 20, 2016

why not stringsAsFactors: my personal experience

If you are using R and most likely you will encounter stringsAsFactors when read in files. functions such as read.table set defaultstringsAsFactors to TRUE, which may cause various problems.
If you want to know the history of this argument, you may want to read a post by Roger Peng:
I just had an unexpected experience with stringsAsFactors. I will put down my notes below. This is also my first attempt to use RNotebook in Rstudio :)

## dummy examples
library(dplyr)
df<- data.frame(chr1=c(1,2,3), start1 = c(10,20,30), end1 = c(30,40,50), chr2=c(1,2,5), 
                type = c("BND", "BND", "BND"))
df
chr1
<dbl>
start1
<dbl>
end1
<dbl>
chr2
<dbl>
type
<fctr>
110301BND
220402BND
330505BND
Now, I want to creat a new column type2. if chr1 is the same as chr2, I set it to foldbackInversion, if not, keep it the same as type

df %>% mutate(type2 = ifelse(chr1==chr2, "foldbackInversion", type))
chr1
<dbl>
start1
<dbl>
end1
<dbl>
chr2
<dbl>
type
<fctr>
type2
<chr>
110301BNDfoldbackInversion
220402BNDfoldbackInversion
330505BND1
Did you just see row3 the type2 becomes 1!!!
This is because type is stroed as factor, and interally R uses intergers to repsent them to save space. If you use dplyr’s internal if_else()function which is stricter in checking the types, you will get errors.

df %>% mutate(type2 = if_else(chr1==chr2, "foldbackInversion", type))
Error: `false` has type 'integer' not 'character'
How to fix it? change the factors to characters!!

df$type<- as.character(df$type)
df %>% mutate(type2 = if_else(chr1==chr2, "foldbackInversion", type))
chr1
<dbl>
start1
<dbl>
end1
<dbl>
chr2
<dbl>
type
<chr>
type2
<chr>
110301BNDfoldbackInversion
220402BNDfoldbackInversion
330505BNDBND

Sunday, October 2, 2016

Cutting out 500 columns from a 26G file using command line

I have a 26 G tsv file with several thousand columns. I want to extract 500 columns from it based on the column names in another file.

How should I do it? Reading into R may take forever, although one may recommend using data.table to fread in the data to save some time. However,  R is notorious for having to read in the data into memory. 26G is very big and my desktop does not have that power. Handling large data sets in R may give you some alternatives to work with big data in R.

I decided to turn to the all-mighty unix commands.

A dummy example for testing

cat DATA.tsv 
ID	head1	head2	head3	head4
1	25.5	1364.0	22.5	13.2
2	10.1	215.56	1.15	22.2

cat LIST.TXT 
ID
head1
head4

I need to extract column ID, head1 and head4 from DATA.tsv.

## the column number to be extracted

head -1 DATA.tsv | tr "\t" "\n" | grep -nf LIST.TXT |  sed 's/:.*$//'
1
2
5

### save to a variable and format it to 1,2,5 for cut command

cols=$(head -1 DATA.tsv | tr "\t" "\n" | grep -nf LIST.TXT | sed 's/:.*$//' | tr "\n" "," | sed 's/,$//')

## cut out

cut -f "${cols}" DATA.tsv 
ID	head1	head4
1	25.5	13.2
2	10.1	22.2

benchmarking for my 26G file:

time cut -f "${cols}" myfile.tsv > mysubset.txt

real    32m10.947s
user    31m42.511s
sys     0m26.686s


## memory usage very low!
top -M

top - 17:03:17 up 86 days,  4:43, 56 users,  load average: 13.99, 13.72, 13.05
Tasks: 754 total,   2 running, 742 sleeping,   5 stopped,   5 zombie
Cpu(s): 13.8%us,  5.2%sy,  0.0%ni, 80.3%id,  0.0%wa,  0.0%hi,  0.7%si,  0.0%st
Mem:    31.354G total, 6535.461M used,   24.971G free,  274.668M buffers
Swap:   32.000G total, 2132.094M used,   29.918G free, 1367.434M cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                         
18042 mtang1    20   0  102m 4808  604 R 100.0  0.0   5:41.71 cut                                                                       
Since I may use it very often, I made it to a shell script and one can specify the separator to be comma or tab.
#! /bin/bash
set -e
set -u
set -o pipefail
#### Author: Ming Tang (Tommy)
#### Date 09/29/2016
#### I got the idea from this stackOverflow post http://stackoverflow.com/questions/11098189/awk-extract-columns-from-file-based-on-header-selected-from-2nd-file
# show help
show_help(){
cat << EOF
This is a wrapper extracting columns of a (big) dataframe based on a list of column names in another
file. The column names must be one per line. The output will be stdout. For small files < 2G, one
can load it into R and do it easily, but when the file is big > 10G. R is quite cubersome.
Using unix commands on the other hand is better because files do not have to be loaded into memory at once.
e.g. subset a 26G size file for 700 columns takes around 30 mins. Memory footage is very low ~4MB.
usage: ${0##*/} -f < a dataframe > -c < colNames> -d <delimiter of the file>
-h display this help and exit.
-f the file you want to extract columns from. must contain a header with column names.
-c a file with the one column name per line.
-d delimiter of the dataframe: , or \t. default is tab.
e.g.
for tsv file:
${0##*/} -f mydata.tsv -c colnames.txt -d $'\t' or simply ommit the -d, default is tab.
for csv file: Note you have to specify -d , if your file is csv, otherwise all columns will be cut out.
${0##*/} -f mydata.csv -c colnames.txt -d ,
EOF
}
## if there are no arguments provided, show help
if [[ $# == 0 ]]; then show_help; exit 1; fi
while getopts ":hf:c:d:" opt; do
case "$opt" in
h) show_help;exit 0;;
f) File2extract=$OPTARG;;
c) colNames=$OPTARG;;
d) delim=$OPTARG;;
'?') echo "Invalid option $OPTARG"; show_help >&2; exit 1;;
esac
done
## set up the default delimiter to be tab, Note the way I specify tab
delim=${delim:-$'\t'}
## get the number of columns in the data frame that match the column names in the colNames file.
## change the output to 2,5,6,22,... and get rid of the last comma so cut -f can be used
cols=$(head -1 "${File2extract}" | tr "${delim}" "\n" | grep -nf "${colNames}" | sed 's/:.*$//' | tr "\n" "," | sed 's/,$//')
## cut out the columns
cut -d"${delim}" -f"${cols}" "${File2extract}"


Again, Unix commands are awesome!