see post here:
Exons, introns and intergenic regions
I'm interested in defining the exons, introns and intergenic regions of the genome based on GENCODE. So I thought about a fuss free way to do this, and BEDTools would be perfect.
Download the latest GENCODE annotations (version 14):
1
| wget ftp : //ftp .sanger.ac.uk /pub/gencode/release_14/gencode .v14.annotation.gtf.gz |
I will use mergeBed, subtractBed and complementBed. See here for usage examples.
The GTF file defines exons, so I will find all exons, output as a minimal bed file, sort the bed file and merge the overlapping exons:
1
2
3
4
| zcat gencode.v14.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4,$5}' > gencode_v14_exon.bed sortBed -i gencode_v14_exon.bed > gencode_v14_exon_temp.bed mv -f gencode_v14_exon_temp.bed gencode_v14_exon.bed mergeBed -i gencode_v14_exon.bed > gencode_v14_exon_merged.bed |
To define intronic regions, we need to define the gene i.e. obtain the gene coordinates, and subtract the exons from this.
1
2
3
4
| zcat gencode.v14.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4,$5}' > gencode_v14_gene.bed sortBed -i gencode_v14_gene.bed > gencode_v14_gene_temp.bed mv -f gencode_v14_gene_temp.bed gencode_v14_gene.bed subtractBed -a gencode_v14_gene.bed -b gencode_v14_exon_merged.bed > gencode_v14_intron.bed |
And finally to define intergenic regions, we use complementBed to find regions not covered by genes. To create a hg19_chrom_info.txt file, use the fetchChromSizes executable available at http://hgdownload.cse.ucsc.edu/admin/exe/ to create the hg19_chrom_info.txt file for hg19.
1
2
| . /fetchChromSizes hg19 > hg19_chrom_info.txt complementBed -i gencode_v14_gene.bed -g hg19_chrom_info.txt > gencode_v14_intergenic.bed |
And that's it.
Figure 1. The genome is a black box.
I have copied your method line by line, but it appears that I am still getting intergenic regions in your "introns" bed file. What could cause this? I am using gencode v19. Specifically I am overlapping H3K1 summits obtained from MACS2 and then overlapping them using bedtools intersect with the intron file, a large majority of the summits are being called back even though in IGV these regions are distal intergenic.
ReplyDeletecould you please paste your commands?
ReplyDelete2) Seperated the gencode annotation file into several genomic regions
Delete### exons
gzcat gencode.v19.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4,$5}' > gencode_v19_exon.bed
sort -k 1,1 -k2,2n -i gencode_v19_exon.bed > gencode_v19_exon_temp.bed
mv -f gencode_v19_exon_temp.bed gencode_v19_exon.bed
mergeBed -i gencode_v19_exon.bed > gencode_v19_exon_merged.bed
### introns
gzcat gencode.v19.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4,$5}' > gencode_v19_gene.bed
sort -k 1,1 -k2,2n -i gencode_v19_gene.bed > gencode_v19_gene_temp.bed
mv -f gencode_v19_gene_temp.bed gencode_v19_gene.bed
subtractBed -a gencode_v19_gene.bed -b gencode_v19_exon_merged.bed > gencode_v19_intron.bed
### intergenic
complementBed -i gencode_v19_gene.bed -g hg19.chrom.sizes > gencode_v19_intergenic.bed
Are you visualizing in IGV? IGV uses refgene as annotations. the distal intergenic may be a non-coding intron in gencode annotations. You can upload the gencode_v19_intron.bed to IGV to see whether they show up in "intergenic" regions as defined by refgene.
DeleteSorry for the late response.
Tommy
Bug Note: GTF files are 1-based and inclusive on both sides of the interval and BED files are 0-based and non-inclusive on the right. Thus to convert GTF intervals to BED intervals you need to do {print $1,$4-1,$5} in the awk code in the first steps. So the awk code should be written as:
ReplyDeleteawk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5}'
Thanks! I am aware of that :) see http://crazyhottommy.blogspot.com/2018/03/three-gotchas-when-using-r-for-genomic.html
DeleteMy goal was to get intronic regions from GTF file and I followed each step just the way you've explained. However, I saw there were some chromosomal positions with same coordinates but different gene name (annotation).
ReplyDeletechr1 123 456 abc
chr1 123 456 def
How is that possible?? What does that mean. I would be glad if you could input your thoughts.
Thanks!