Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Friday, May 17, 2013

Find exons introns and intergenic regions


see post here:

Exons, introns and intergenic regions

I'm interested in defining the exons, introns and intergenic regions of the genome based on GENCODE. So I thought about a fuss free way to do this, and BEDTools would be perfect.
Download the latest GENCODE annotations (version 14):
1
wget ftp://ftp.sanger.ac.uk/pub/gencode/release_14/gencode.v14.annotation.gtf.gz
I will use mergeBed, subtractBed and complementBed. See here for usage examples.
The GTF file defines exons, so I will find all exons, output as a minimal bed file, sort the bed file and merge the overlapping exons:
1
2
3
4
zcat gencode.v14.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4,$5}' > gencode_v14_exon.bed
sortBed -i gencode_v14_exon.bed > gencode_v14_exon_temp.bed
mv -f gencode_v14_exon_temp.bed gencode_v14_exon.bed
mergeBed -i gencode_v14_exon.bed > gencode_v14_exon_merged.bed
To define intronic regions, we need to define the gene i.e. obtain the gene coordinates, and subtract the exons from this.
1
2
3
4
zcat gencode.v14.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4,$5}' > gencode_v14_gene.bed
sortBed -i gencode_v14_gene.bed > gencode_v14_gene_temp.bed
mv -f gencode_v14_gene_temp.bed gencode_v14_gene.bed
subtractBed -a gencode_v14_gene.bed -b gencode_v14_exon_merged.bed > gencode_v14_intron.bed
And finally to define intergenic regions, we use complementBed to find regions not covered by genes. To create a hg19_chrom_info.txt file, use the fetchChromSizes executable available at http://hgdownload.cse.ucsc.edu/admin/exe/ to create the hg19_chrom_info.txt file for hg19.
1
2
./fetchChromSizes hg19 > hg19_chrom_info.txt
complementBed -i gencode_v14_gene.bed -g hg19_chrom_info.txt > gencode_v14_intergenic.bed
And that's it.
bedtools
Figure 1. The genome is a black box.

7 comments:

  1. I have copied your method line by line, but it appears that I am still getting intergenic regions in your "introns" bed file. What could cause this? I am using gencode v19. Specifically I am overlapping H3K1 summits obtained from MACS2 and then overlapping them using bedtools intersect with the intron file, a large majority of the summits are being called back even though in IGV these regions are distal intergenic.

    ReplyDelete
  2. could you please paste your commands?

    ReplyDelete
    Replies
    1. 2) Seperated the gencode annotation file into several genomic regions

      ### exons
      gzcat gencode.v19.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4,$5}' > gencode_v19_exon.bed

      sort -k 1,1 -k2,2n -i gencode_v19_exon.bed > gencode_v19_exon_temp.bed

      mv -f gencode_v19_exon_temp.bed gencode_v19_exon.bed

      mergeBed -i gencode_v19_exon.bed > gencode_v19_exon_merged.bed

      ### introns
      gzcat gencode.v19.annotation.gtf.gz | awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4,$5}' > gencode_v19_gene.bed

      sort -k 1,1 -k2,2n -i gencode_v19_gene.bed > gencode_v19_gene_temp.bed

      mv -f gencode_v19_gene_temp.bed gencode_v19_gene.bed

      subtractBed -a gencode_v19_gene.bed -b gencode_v19_exon_merged.bed > gencode_v19_intron.bed

      ### intergenic
      complementBed -i gencode_v19_gene.bed -g hg19.chrom.sizes > gencode_v19_intergenic.bed

      Delete
    2. Are you visualizing in IGV? IGV uses refgene as annotations. the distal intergenic may be a non-coding intron in gencode annotations. You can upload the gencode_v19_intron.bed to IGV to see whether they show up in "intergenic" regions as defined by refgene.

      Sorry for the late response.

      Tommy

      Delete
  3. Bug Note: GTF files are 1-based and inclusive on both sides of the interval and BED files are 0-based and non-inclusive on the right. Thus to convert GTF intervals to BED intervals you need to do {print $1,$4-1,$5} in the awk code in the first steps. So the awk code should be written as:
    awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5}'

    ReplyDelete
    Replies
    1. Thanks! I am aware of that :) see http://crazyhottommy.blogspot.com/2018/03/three-gotchas-when-using-r-for-genomic.html

      Delete
  4. My goal was to get intronic regions from GTF file and I followed each step just the way you've explained. However, I saw there were some chromosomal positions with same coordinates but different gene name (annotation).
    chr1 123 456 abc
    chr1 123 456 def

    How is that possible?? What does that mean. I would be glad if you could input your thoughts.

    Thanks!

    ReplyDelete