Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, August 4, 2015

linux command join with header

See the ipython notebook here 

https://github.com/crazyhottommy/scripts-general-use/blob/master/Shell/Join_command.ipynb

linux commands are very powerful.

They are small but can do small things well.
Today, I am going to demonstrate a very useful linux command join.
join is used to join two files based on a common "ID" column.
This is a common task when you have information stored in two files
and have a common column to link them together.
I first created two dummy tab delimited files: A.txt and B.txt
Let's have a look at the files:
In [1]:
cat A.txt
Jack TX 5
John CA 3
Luke WA 4
Tammy FL 2
Tommy FL 1
In [2]:
cat B.txt
1 450
2 300
3 400
5 150
6 600
column 3 in file A and column 1 in file B are common IDs that link these two files together.
Let's do an inner join:
In [3]:
join -1 3 -2 1 A.txt B.txt
5 Jack TX 150
-1 means file 1, which is A file. 3 means the third column.
-2 means file 2, which is B file. 1 means the first column.
The output is not we expected. A file and B file should have 4 rows linked together.
**It turns out files need to be sorted first according to the common
field for join to work.** Let's sort them first and join them:
In [4]:
sort -k3,3 A.txt > A.sorted.txt
sort -k1,1 B.txt > B.sorted.txt
join -1 3 -2 1 A.sorted.txt B.sorted.txt
1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
5 Jack TX 150
we can also do right outer join (-a 1) :
In [5]:
join -1 3 -2 1 -a 1 A.sorted.txt B.sorted.txt
1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
Luke WA 4
5 Jack TX 150
The default input seprator is space or tab, and output seprator is space.
we can change the output seprator to tab:
In [6]:
join -t$'\t' -1 3 -2 1 -a 1 -o auto A.sorted.txt B.sorted.txt
1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
4 Luke WA 
5 Jack TX 150
Let's do a left outer join (-a 2) and fill the empty field with NA.
Note that only GNU join has the auto flag.
In [7]:
join -t$'\t' -e "NA" -1 3 -2 1 -a 2 -o auto A.sorted.txt B.sorted.txt
1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
5 Jack TX 150
6 NA NA 600
You may think the intermediate sorted file is annoying. One can avoid them using
process substitution:
In [8]:
join -1 3 -2 1 -t $'\t' <(sort -k3,3 A.txt) <(sort -k1,1 B.txt)
1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
5 Jack TX 150
Wow! It is so magic! Using process substitution can speed up your workflow.
So far, the files we have are without headers. What if they contain headers?
Let's make some dummy files first:
In [9]:
printf "name\tstate\tid\n" | cat - A.txt > A_header.txt
printf "id\tsalary\n" | cat - B.txt > B_header.txt

In [10]:
cat A_header.txt
name state id
Jack TX 5
John CA 3
Luke WA 4
Tammy FL 2
Tommy FL 1
In [11]:
cat B_header.txt
id salary
1 450
2 300
3 400
5 150
6 600
Let's join them together:
In [12]:
join -1 3 -2 1 -t $'\t' <(sort -k3,3 A_header.txt) <(sort -k1,1 B_header.txt)
1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
5 Jack TX 150
id name state salary
Unfortunately, after sorting, the header went to the bottom of the file. Many times, the header may just go to the middle of the file. We can always delete the header first,
join the file and then add the header back, but we will need to have intermediate files.
I will use body function which will ignore the header.
In [13]:
join -1 3 -2 1 -t $'\t' <(cat A_header.txt | body sort -k3,3) <(cat B_header.txt | body sort -k1,1)
id name state salary
1 Tommy FL 450
2 Tammy FL 300
3 John CA 400
5 Jack TX 150

No comments:

Post a Comment