I have not written in python for so long, I've been using R for a long time. This provide me a good practice opportunity.
The basic problem is to format some files, and I know it can be done by other tools like sed and awk ( see an awk solution below), but python just gives you much more flexibility in terms of what you can do and gives you more readability.
myfasta file
>comp1558_c0_seq1 len=204 path=[4976:0-203]
ATTGTACTTAATCTA.....
>comp8142_c0_seq1 len=222 path=[8835:0-221]
GTACAATCACGGCACATT......
>comp8570_c0_seq1 len=232 path=[3344:0-232]
GCATCGCTTATCGGTTTA......
annotation file:
comp1558_c0_seq1 repressor protein
comp8142_c0_seq1 aspartate aminotransferase
comp8357_c0_seq1 l-xylulose reductase
comp8387_c0_seq1 succinyl- synthetase alpha
comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase
see the code below:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
with open ("C:/Users/Tang Ming/Desktop/anotation.txt", "r") as annotation: | |
anotation_dict = {} | |
for line in annotation: | |
line = line.split() | |
if line: #test whether it is an empty line | |
anotation_dict[line[0]]=line[1:] | |
else: | |
continue | |
# really should not parse the fasta file by myself. there are | |
# many parsers already there. you can use module from Biopython | |
ofile = open ("C:/Users/Tang Ming/Desktop/output.txt", "w") | |
with open ("C:/Users/Tang Ming/Desktop/my_fasta.txt", "r") as myfasta: | |
for line in myfasta: | |
if line.startswith (">"): | |
line = line[1:] # skip the ">" character | |
line = line.split() | |
if line[0] in anotation_dict: | |
new_line = ">" + str(line[0]) + " " + " ".join(anotation_dict[line[0]]) | |
ofile.write ( new_line + "\n") | |
else: | |
ofile.write ( ">"+ "".join(line) + "\n") | |
else: | |
ofile.write(line +"\n") | |
ofile.close() # always remember to close the file. |
output:
>comp1558_c0_seq1 repressor protein
ATTGTACTTAATCTA.....
>comp8142_c0_seq1 aspartate aminotransferase
GTACAATCACGGCACATT......
>comp8570_c0_seq1 fatty acid synthase beta subunit dehydratase
GCATCGCTTATCGGTTTA......
if the dictionary file has only two records for each line, this awk should work:
http://stackoverflow.com/questions/11678939/replace-text-based-on-a-dictionary
Usage:
http://www.gnu.org/software/gawk/manual/html_node/String-Functions.html
http://www.gnu.org/software/gawk/manual/html_node/Arrays.html
awk -f foo.awk dict.dat user.dat
http://www.gnu.org/software/gawk/manual/html_node/String-Functions.html
http://www.gnu.org/software/gawk/manual/html_node/Arrays.html
NR == FNR {
rep[$1] = $2
next
}
{
for (key in rep) {
gsub(key, rep[key])
}
print
}
happy it is useful!
ReplyDelete