Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Sunday, October 2, 2016

Cutting out 500 columns from a 26G file using command line

I have a 26 G tsv file with several thousand columns. I want to extract 500 columns from it based on the column names in another file.

How should I do it? Reading into R may take forever, although one may recommend using data.table to fread in the data to save some time. However,  R is notorious for having to read in the data into memory. 26G is very big and my desktop does not have that power. Handling large data sets in R may give you some alternatives to work with big data in R.

I decided to turn to the all-mighty unix commands.

A dummy example for testing

cat DATA.tsv 
ID	head1	head2	head3	head4
1	25.5	1364.0	22.5	13.2
2	10.1	215.56	1.15	22.2

cat LIST.TXT 
ID
head1
head4

I need to extract column ID, head1 and head4 from DATA.tsv.

## the column number to be extracted

head -1 DATA.tsv | tr "\t" "\n" | grep -nf LIST.TXT |  sed 's/:.*$//'
1
2
5

### save to a variable and format it to 1,2,5 for cut command

cols=$(head -1 DATA.tsv | tr "\t" "\n" | grep -nf LIST.TXT | sed 's/:.*$//' | tr "\n" "," | sed 's/,$//')

## cut out

cut -f "${cols}" DATA.tsv 
ID	head1	head4
1	25.5	13.2
2	10.1	22.2

benchmarking for my 26G file:

time cut -f "${cols}" myfile.tsv > mysubset.txt

real    32m10.947s
user    31m42.511s
sys     0m26.686s


## memory usage very low!
top -M

top - 17:03:17 up 86 days,  4:43, 56 users,  load average: 13.99, 13.72, 13.05
Tasks: 754 total,   2 running, 742 sleeping,   5 stopped,   5 zombie
Cpu(s): 13.8%us,  5.2%sy,  0.0%ni, 80.3%id,  0.0%wa,  0.0%hi,  0.7%si,  0.0%st
Mem:    31.354G total, 6535.461M used,   24.971G free,  274.668M buffers
Swap:   32.000G total, 2132.094M used,   29.918G free, 1367.434M cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                         
18042 mtang1    20   0  102m 4808  604 R 100.0  0.0   5:41.71 cut                                                                       
Since I may use it very often, I made it to a shell script and one can specify the separator to be comma or tab.
#! /bin/bash
set -e
set -u
set -o pipefail
#### Author: Ming Tang (Tommy)
#### Date 09/29/2016
#### I got the idea from this stackOverflow post http://stackoverflow.com/questions/11098189/awk-extract-columns-from-file-based-on-header-selected-from-2nd-file
# show help
show_help(){
cat << EOF
This is a wrapper extracting columns of a (big) dataframe based on a list of column names in another
file. The column names must be one per line. The output will be stdout. For small files < 2G, one
can load it into R and do it easily, but when the file is big > 10G. R is quite cubersome.
Using unix commands on the other hand is better because files do not have to be loaded into memory at once.
e.g. subset a 26G size file for 700 columns takes around 30 mins. Memory footage is very low ~4MB.
usage: ${0##*/} -f < a dataframe > -c < colNames> -d <delimiter of the file>
-h display this help and exit.
-f the file you want to extract columns from. must contain a header with column names.
-c a file with the one column name per line.
-d delimiter of the dataframe: , or \t. default is tab.
e.g.
for tsv file:
${0##*/} -f mydata.tsv -c colnames.txt -d $'\t' or simply ommit the -d, default is tab.
for csv file: Note you have to specify -d , if your file is csv, otherwise all columns will be cut out.
${0##*/} -f mydata.csv -c colnames.txt -d ,
EOF
}
## if there are no arguments provided, show help
if [[ $# == 0 ]]; then show_help; exit 1; fi
while getopts ":hf:c:d:" opt; do
case "$opt" in
h) show_help;exit 0;;
f) File2extract=$OPTARG;;
c) colNames=$OPTARG;;
d) delim=$OPTARG;;
'?') echo "Invalid option $OPTARG"; show_help >&2; exit 1;;
esac
done
## set up the default delimiter to be tab, Note the way I specify tab
delim=${delim:-$'\t'}
## get the number of columns in the data frame that match the column names in the colNames file.
## change the output to 2,5,6,22,... and get rid of the last comma so cut -f can be used
cols=$(head -1 "${File2extract}" | tr "${delim}" "\n" | grep -nf "${colNames}" | sed 's/:.*$//' | tr "\n" "," | sed 's/,$//')
## cut out the columns
cut -d"${delim}" -f"${cols}" "${File2extract}"


Again, Unix commands are awesome!

No comments:

Post a Comment