Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, October 20, 2016

why not stringsAsFactors: my personal experience

If you are using R and most likely you will encounter stringsAsFactors when read in files. functions such as read.table set defaultstringsAsFactors to TRUE, which may cause various problems.
If you want to know the history of this argument, you may want to read a post by Roger Peng:
I just had an unexpected experience with stringsAsFactors. I will put down my notes below. This is also my first attempt to use RNotebook in Rstudio :)

## dummy examples
library(dplyr)
df<- data.frame(chr1=c(1,2,3), start1 = c(10,20,30), end1 = c(30,40,50), chr2=c(1,2,5), 
                type = c("BND", "BND", "BND"))
df
chr1
<dbl>
start1
<dbl>
end1
<dbl>
chr2
<dbl>
type
<fctr>
110301BND
220402BND
330505BND
Now, I want to creat a new column type2. if chr1 is the same as chr2, I set it to foldbackInversion, if not, keep it the same as type

df %>% mutate(type2 = ifelse(chr1==chr2, "foldbackInversion", type))
chr1
<dbl>
start1
<dbl>
end1
<dbl>
chr2
<dbl>
type
<fctr>
type2
<chr>
110301BNDfoldbackInversion
220402BNDfoldbackInversion
330505BND1
Did you just see row3 the type2 becomes 1!!!
This is because type is stroed as factor, and interally R uses intergers to repsent them to save space. If you use dplyr’s internal if_else()function which is stricter in checking the types, you will get errors.

df %>% mutate(type2 = if_else(chr1==chr2, "foldbackInversion", type))
Error: `false` has type 'integer' not 'character'
How to fix it? change the factors to characters!!

df$type<- as.character(df$type)
df %>% mutate(type2 = if_else(chr1==chr2, "foldbackInversion", type))
chr1
<dbl>
start1
<dbl>
end1
<dbl>
chr2
<dbl>
type
<chr>
type2
<chr>
110301BNDfoldbackInversion
220402BNDfoldbackInversion
330505BNDBND

Sunday, October 2, 2016

Cutting out 500 columns from a 26G file using command line

I have a 26 G tsv file with several thousand columns. I want to extract 500 columns from it based on the column names in another file.

How should I do it? Reading into R may take forever, although one may recommend using data.table to fread in the data to save some time. However,  R is notorious for having to read in the data into memory. 26G is very big and my desktop does not have that power. Handling large data sets in R may give you some alternatives to work with big data in R.

I decided to turn to the all-mighty unix commands.
Since I may use it very often, I made it to a shell script and one can specify the separator to be comma or tab.


Again, Unix commands are awesome!