Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Sunday, October 2, 2016

Cutting out 500 columns from a 26G file using command line

I have a 26 G tsv file with several thousand columns. I want to extract 500 columns from it based on the column names in another file.

How should I do it? Reading into R may take forever, although one may recommend using data.table to fread in the data to save some time. However,  R is notorious for having to read in the data into memory. 26G is very big and my desktop does not have that power. Handling large data sets in R may give you some alternatives to work with big data in R.

I decided to turn to the all-mighty unix commands.
Since I may use it very often, I made it to a shell script and one can specify the separator to be comma or tab.


Again, Unix commands are awesome!

No comments:

Post a Comment