Diving into Genetics and Genomics: October 2016

Thursday, October 20, 2016

why not stringsAsFactors: my personal experience

If you are using R and most likely you will encounter stringsAsFactors when read in files. functions such as read.table set defaultstringsAsFactors to TRUE, which may cause various problems.

If you want to know the history of this argument, you may want to read a post by Roger Peng:

stringsAsFactors: An unauthorized biography

I just had an unexpected experience with stringsAsFactors. I will put down my notes below. This is also my first attempt to use RNotebook in Rstudio :)

## dummy examples
library(dplyr)
df<- data.frame(chr1=c(1,2,3), start1 = c(10,20,30), end1 = c(30,40,50), chr2=c(1,2,5), 
                type = c("BND", "BND", "BND"))
df

chr1 <dbl>	start1 <dbl>	end1 <dbl>	chr2 <dbl>	type <fctr>
1	10	30	1	BND
2	20	40	2	BND
3	30	50	5	BND

Now, I want to creat a new column type2. if chr1 is the same as chr2, I set it to foldbackInversion, if not, keep it the same as type

df %>% mutate(type2 = ifelse(chr1==chr2, "foldbackInversion", type))

chr1 <dbl>	start1 <dbl>	end1 <dbl>	chr2 <dbl>	type <fctr>	type2 <chr>
1	10	30	1	BND	foldbackInversion
2	20	40	2	BND	foldbackInversion
3	30	50	5	BND	1

Did you just see row3 the type2 becomes 1!!!

This is because type is stroed as factor, and interally R uses intergers to repsent them to save space. If you use dplyr’s internal if_else()function which is stricter in checking the types, you will get errors.

df %>% mutate(type2 = if_else(chr1==chr2, "foldbackInversion", type))

Error: `false` has type 'integer' not 'character'

How to fix it? change the factors to characters!!

df$type<- as.character(df$type)
df %>% mutate(type2 = if_else(chr1==chr2, "foldbackInversion", type))

chr1 <dbl>	start1 <dbl>	end1 <dbl>	chr2 <dbl>	type <chr>	type2 <chr>
1	10	30	1	BND	foldbackInversion
2	20	40	2	BND	foldbackInversion
3	30	50	5	BND	BND

Sunday, October 2, 2016

Cutting out 500 columns from a 26G file using command line

I have a 26 G tsv file with several thousand columns. I want to extract 500 columns from it based on the column names in another file.

How should I do it? Reading into R may take forever, although one may recommend using data.table to fread in the data to save some time. However, R is notorious for having to read in the data into memory. 26G is very big and my desktop does not have that power. Handling large data sets in R may give you some alternatives to work with big data in R.

I decided to turn to the all-mighty unix commands.
Since I may use it very often, I made it to a shell script and one can specify the separator to be comma or tab.

Again, Unix commands are awesome!

Diving into Genetics and Genomics

My github papge

Thursday, October 20, 2016

why not stringsAsFactors: my personal experience

Sunday, October 2, 2016

Cutting out 500 columns from a 26G file using command line

Labels

My Blog List