Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Tuesday, July 22, 2014

use python to change the header of a fasta file based on a dictionary in another file

I saw on SEQanwser, someone asked this question here, and I used python to resolve this problem.
I have not written in python for so long, I've been using R for a long time. This provide me a good practice opportunity.

The basic problem is to format some files, and I know it can be done by other tools like sed and awk ( see an awk solution below), but python just gives you much more flexibility in terms of what you can do and gives you more readability.
myfasta file

>comp1558_c0_seq1 len=204 path=[4976:0-203]
ATTGTACTTAATCTA.....
>comp8142_c0_seq1 len=222 path=[8835:0-221]
GTACAATCACGGCACATT......
>comp8570_c0_seq1 len=232 path=[3344:0-232]
GCATCGCTTATCGGTTTA......

annotation file:

comp1558_c0_seq1 repressor protein
comp8142_c0_seq1 aspartate aminotransferase
comp8357_c0_seq1 l-xylulose reductase
comp8387_c0_seq1 succinyl- synthetase alpha
comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase

see the code below:
with open ("C:/Users/Tang Ming/Desktop/anotation.txt", "r") as annotation:
anotation_dict = {}
for line in annotation:
line = line.split()
if line: #test whether it is an empty line
anotation_dict[line[0]]=line[1:]
else:
continue
# really should not parse the fasta file by myself. there are
# many parsers already there. you can use module from Biopython
ofile = open ("C:/Users/Tang Ming/Desktop/output.txt", "w")
with open ("C:/Users/Tang Ming/Desktop/my_fasta.txt", "r") as myfasta:
for line in myfasta:
if line.startswith (">"):
line = line[1:] # skip the ">" character
line = line.split()
if line[0] in anotation_dict:
new_line = ">" + str(line[0]) + " " + " ".join(anotation_dict[line[0]])
ofile.write ( new_line + "\n")
else:
ofile.write ( ">"+ "".join(line) + "\n")
else:
ofile.write(line +"\n")
ofile.close() # always remember to close the file.




output:
>comp1558_c0_seq1 repressor protein
ATTGTACTTAATCTA.....
>comp8142_c0_seq1 aspartate aminotransferase
GTACAATCACGGCACATT......
>comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase
GCATCGCTTATCGGTTTA......

if the dictionary file has only two records for each line, this awk should work:
http://stackoverflow.com/questions/11678939/replace-text-based-on-a-dictionary
NR == FNR {
  rep[$1] = $2
  next
} 

{
    for (key in rep) {
      gsub(key, rep[key])
    }
    print
}

1 comment: