Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, August 25, 2015

sort with header maintained

We usually want to maintain header when we do something for the body of the text.
You can find a body function here which can only accept streams and assumes only one line header.

Use subshells and awk can do the same job and potentially more flexible.


## imagine we have a file with one line header, and we want to keep the header after sorting
## use subshells http://bash.cyberciti.biz/guide/What_is_a_Subshell%3F
(sed -n '1p' your_file; cat your_file | sed '1d' | sort) > sort_header.txt
## if you have two header lines and want to keep both of them:
(sed -n '1,2p' your_file; cat your_file | sed '1,2d' | sort) > sort_header.txt
## if you have many lines starting with "#" as header, like vcf files
(grep "^#" my_vcf; grep -v "^#" my_vcf | sort -k1,1V -k2,2n) > sorted.vcf
## one can also use awk
cat my_vcf | awk '$0~"^#" { print $0; next } { print $0 | "LC_ALL=C sort -k1,1V -k2,2n" }'
## I am a useless cat user :) http://stackoverflow.com/questions/11710552/useless-use-of-cat

The original credits go to Aaron Quinlan. see a gist below to sort vcf files in natural chromosome order :chr1 chr2 chr3.... rather than chr1 chr10 chr11....

chmod a+x vcfsort.sh
vcfsort.sh trio.trim.vep.vcf.gz
sort VCF and keep (only the first) header as-is:
awk 'BEGIN{x=0;} $0 ~/^#/{ if(x==0) {print;} next}{x=1; print $0 | "sort -k1,1 -k2,2n"}'
#!/bin/bash
# Faster, but can't handle streams
[ $# -eq 0 ] && { echo "Sorts a VCF file in natural chromosome order";\
echo "Usage: $0 [my.vcf | my.vcf.gz]"; exit 1;
}
# cheers, @michaelhoffman
if (zless $1 | grep ^#; zless $1 | grep -v ^# | LC_ALL=C sort -k1,1V -k2,2n);
then
exit 0
else
printf 'sort failed. Does your version of sort support the -V option?\n'
printf 'If not, you should update sort with the latest from GNU coreutils:\n'
printf 'git clone git://git.sv.gnu.org/coreutils'
fi
view raw vcfsort.sh hosted with ❤ by GitHub
#!/bin/bash
# Slower, but handles streams.
[ $# -eq 0 ] && { echo "Sorts a VCF file in natural chromosome order";\
echo "Usage: $0 [my.vcf | my.vcf.gz]"; \
echo "Usage: cat [my.vcf | my.vcf.gz] | $0"; \
exit 1;
}
# cheers, @colbychiang
if zless $1 | awk '$0~"^#" { print $0; next } { print $0 | "LC_ALL=C sort -k1,1V -k2,2n" }';
then
exit 0
else
printf 'sort failed. Does your version of sort support the -V option?\n'
printf 'If not, you should update sort with the latest from GNU coreutils:\n'
printf 'git clone git://git.sv.gnu.org/coreutils'
fi
view raw vcfsort2.sh hosted with ❤ by GitHub

No comments:

Post a Comment