How should I do it? Reading into R may take forever, although one may recommend using data.table to fread in the data to save some time. However, R is notorious for having to read in the data into memory. 26G is very big and my desktop does not have that power. Handling large data sets in R may give you some alternatives to work with big data in R.
I decided to turn to the all-mighty unix commands.
A dummy example for testing
cat DATA.tsv
ID head1 head2 head3 head4
1 25.5 1364.0 22.5 13.2
2 10.1 215.56 1.15 22.2
cat LIST.TXT
ID
head1
head4
I need to extract column ID
, head1
and head4
from DATA.tsv.
## the column number to be extracted
head -1 DATA.tsv | tr "\t" "\n" | grep -nf LIST.TXT | sed 's/:.*$//'
1
2
5
### save to a variable and format it to 1,2,5 for cut command
cols=$(head -1 DATA.tsv | tr "\t" "\n" | grep -nf LIST.TXT | sed 's/:.*$//' | tr "\n" "," | sed 's/,$//')
## cut out
cut -f "${cols}" DATA.tsv
ID head1 head4
1 25.5 13.2
2 10.1 22.2
benchmarking for my 26G file:
time cut -f "${cols}" myfile.tsv > mysubset.txt
real 32m10.947s
user 31m42.511s
sys 0m26.686s
## memory usage very low!
top -M
top - 17:03:17 up 86 days, 4:43, 56 users, load average: 13.99, 13.72, 13.05
Tasks: 754 total, 2 running, 742 sleeping, 5 stopped, 5 zombie
Cpu(s): 13.8%us, 5.2%sy, 0.0%ni, 80.3%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st
Mem: 31.354G total, 6535.461M used, 24.971G free, 274.668M buffers
Swap: 32.000G total, 2132.094M used, 29.918G free, 1367.434M cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18042 mtang1 20 0 102m 4808 604 R 100.0 0.0 5:41.71 cut
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#! /bin/bash | |
set -e | |
set -u | |
set -o pipefail | |
#### Author: Ming Tang (Tommy) | |
#### Date 09/29/2016 | |
#### I got the idea from this stackOverflow post http://stackoverflow.com/questions/11098189/awk-extract-columns-from-file-based-on-header-selected-from-2nd-file | |
# show help | |
show_help(){ | |
cat << EOF | |
This is a wrapper extracting columns of a (big) dataframe based on a list of column names in another | |
file. The column names must be one per line. The output will be stdout. For small files < 2G, one | |
can load it into R and do it easily, but when the file is big > 10G. R is quite cubersome. | |
Using unix commands on the other hand is better because files do not have to be loaded into memory at once. | |
e.g. subset a 26G size file for 700 columns takes around 30 mins. Memory footage is very low ~4MB. | |
usage: ${0##*/} -f < a dataframe > -c < colNames> -d <delimiter of the file> | |
-h display this help and exit. | |
-f the file you want to extract columns from. must contain a header with column names. | |
-c a file with the one column name per line. | |
-d delimiter of the dataframe: , or \t. default is tab. | |
e.g. | |
for tsv file: | |
${0##*/} -f mydata.tsv -c colnames.txt -d $'\t' or simply ommit the -d, default is tab. | |
for csv file: Note you have to specify -d , if your file is csv, otherwise all columns will be cut out. | |
${0##*/} -f mydata.csv -c colnames.txt -d , | |
EOF | |
} | |
## if there are no arguments provided, show help | |
if [[ $# == 0 ]]; then show_help; exit 1; fi | |
while getopts ":hf:c:d:" opt; do | |
case "$opt" in | |
h) show_help;exit 0;; | |
f) File2extract=$OPTARG;; | |
c) colNames=$OPTARG;; | |
d) delim=$OPTARG;; | |
'?') echo "Invalid option $OPTARG"; show_help >&2; exit 1;; | |
esac | |
done | |
## set up the default delimiter to be tab, Note the way I specify tab | |
delim=${delim:-$'\t'} | |
## get the number of columns in the data frame that match the column names in the colNames file. | |
## change the output to 2,5,6,22,... and get rid of the last comma so cut -f can be used | |
cols=$(head -1 "${File2extract}" | tr "${delim}" "\n" | grep -nf "${colNames}" | sed 's/:.*$//' | tr "\n" "," | sed 's/,$//') | |
## cut out the columns | |
cut -d"${delim}" -f"${cols}" "${File2extract}" |
No comments:
Post a Comment