Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Wednesday, November 22, 2017

run gistic2 with sequenza segmentation output from whole exome sequencing

Convert sequenza output to gistic input

Gistic was designed for SNP6 array data. I saw many papers use it for whole exome sequencing data as well.
I have the segment files from sequenza and want to convert them to the gistic input.

Input format for gistic:

segment file:
(1) Sample (sample name)
(2) Chromosome (chromosome number)
(3) Start Position (segment start position, in bases)
(4) End Position (segment end position, in bases)
(5) Num markers (number of markers in segment)
(6) Seg.CN (log2() -1 of copy number)
  1. The conversion should be log2 (logarithm base 2) - 1, so that copy number 2 is 0.
  2. Every segment start and end in the segments file should appear in the markers file, not the other way around.
when the copy number is 0 (a homozygous deletion of both copies). You can’t do a log2(0)-1, just put a small number e.g. -5
(1) Marker Name
(2) Chromosome
(3) Marker Position (in bases)
Note gistic2 does not require a marker file anymore.

output of sequenza

sequenza gives a segment file. Segmentation was done by copynumberbioconductor package.
13 columns of the *segments.txt file
"chromosome" "start.pos" "end.pos" "Bf" "N.BAF" "sd.BAF" "depth.ratio" "N.ratio" "sd.ratio" "CNt" "A" "B" "LPP"
We only need the chromosomestart.posend.posN.BAF and depth.ratio columns.
The depth.ratio column is the GC content normalized ratio. a depth ratio of 1 means it has copy number of 2 (the same as the normal blood control in my case).
UPDATED: 12/17/2017. see a comment below. it is not log2(2^ depth.ratio) -1 rather:
To convert to gistic input, I have to do log2(2 * depth.ratio) - 1

UPDATED 01/03/2018
I have a bunch of sgement files in the same folder.
add the sample name in the final column and do the log2 math in R.
library(tidyverse)
library(readr)
seg_files<- list.files(".", pattern = "*segments.txt", full.names = F) 

seg_dat_list <- lapply(seg_files, function(f) {
        dat<- read_tsv(f, col_names = T, col_types = cols(.default = col_character()))
        sample<- gsub("_vs_.*segments.txt", "", f)
        dat$sample<- sample
        return(dat)
})

seg_dat <- do.call(rbind, seg_dat_list)

gistic_input<- seg_dat %>% select(sample, chromosome, start.pos, end.pos, N.BAF, depth.ratio) %>% mutate(depth.ratio = as.numeric(depth.ratio)) %>% mutate(depth.ratio = log2(2 * depth.ratio) -1)

write_tsv(gistic_input, "all_segments.txt")
Back to bash:


## marker file:

cat all_segments.txt | sed '1d' | cut -f2,3 > markers.txt
cat all_segments.txt | sed '1d' | cut -f2,4 >> markers.txt

## sort the files by chromosome, take the unique ones and number the markers.

cat markers.txt | sort -V -k1,1 -k2,2nr | uniq | nl > markers_gistic.txt

Run gistic
modify the gistic2 script a bit. e.g. change MCR_ROOT folder path
#!/bin/sh
## set MCR environment and launch GISTIC executable

## NOTE: change the line below if you have installed the Matlab MCR in an alternative location
MCR_ROOT=/scratch/genomic_med/apps/Matlab_Complier_runTime
MCR_VER=v83

echo Setting Matlab MCR root to $MCR_ROOT

## set up environment variables
LD_LIBRARY_PATH=$MCR_ROOT/$MCR_VER/runtime/glnxa64:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=$MCR_ROOT/$MCR_VER/bin/glnxa64:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=$MCR_ROOT/$MCR_VER/sys/os/glnxa64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH
XAPPLRESDIR=$MCR_ROOT/$MCR_VER/MATLAB_Component_Runtime/v83/X11/app-defaults
export XAPPLRESDIR

## launch GISTIC executable
./gp_gistic2_from_seg $@

I removed ./ from the last line since I have put all executables in my path.
mkdir gistic_out
gistic2 -b gistic_out -seg all_segments.txt -refgene /scratch/genomic_med/apps/gistic/refgenefiles/hg19.mat -mk markers_gistic.txt  -genegistic 1 -smallmem 1 -broad 1 -brlen 0.5 -conf 0.90 -armpeel 1 -savegene 1 -gcm extreme

10 comments:

  1. Hi Tommy,

    Are you sure about log2(2^depth.ratio) - 1 = depth.ratio -1 ?
    I would take log2(2*depth.ratio) - 1 instead, because depth.ratio is not log-transformed in sequenza output.

    ReplyDelete
    Replies
    1. Hi Olga, many thanks for your comment, I think you are right. will update the post!

      Delete
    2. Hi Olga, I just updated the post. thanks again.

      Delete
  2. Super nice, thank you!
    I am running GISTIC standalone 2.02.3.

    I got this error
    Shortened 12 segments in '/vol5/home/wangdo/mybin/GISTIC/myfiles/segmentationfile.txt' that overlap by one marker.

    I wanna know which lines are causing this error. Do you know how can we get detail error message such as GISTIC on genepattern?

    ReplyDelete
    Replies
    1. I am not sure what is this error, ask the GISTIC google group, they should help. I never used GISTIC on genepattern, sorry can not help that.

      Delete
  3. I was able to successfully run GISTIC 2.0 using Sequenza outputs following your guidelines above. When I look at the significantly recurrent CNV regions identified by GISTIC on IGV, however, I cannot seem to agree with the GISTIC results by visual inspection. According to the Sequenza vignette, the field 'CNt' denotes the estimated copy number of tumor. Could you please describe the difference between depth.ratio and CNt? When I calculate tumor copy number using depth.ratio based on your formula above (log2(2 * depth.ratio) - 1) it does not match the corresponding CNt value in each row in the Sequenza output.

    ReplyDelete
    Replies
    1. As an example, I have a row where depth.ratio is 0.096 and CNt is 2. Therefore log2(2 * 0.096) - 1 is -3.38, and not 2.

      Delete
    2. Hi Andy, see a post here https://groups.google.com/forum/#!topic/sequenza-user-group/LAeG7BPeHEo I had the same question as you have.

      >Generally you are right, but it depends on many things: for instance if your ploidy estimate is more then 2 (more then diploid), then your assumptions is wrong.
      For instance if you have a ploidy estimate of 5 or more, then your CNt for a segment with ~0.9 depth.ratio would be around 4.
      Everything depends also on the purity estimate, a step of 0.1 (1-0.9) can be a lot or not too much… there are not fix ratio values that are convertible to specific copy number values, everything is inferred by a model that depend on few parameters. Depending on the parameters the results can change quite drastically.

      Delete
    3. Thank you for your prompt response and explanation. I see that you cited the guidelines in this link (https://github.com/mbourgey/EBI_cancer_workshop_CNV) in your post. When you ran the Sequenza+GISTIC pipeline, were you able to locate/verify the recurrent CNVs on IGV? What is the best way to visually confirm this? By the way, I'm using whole exome sequencing data.

      Delete
    4. remember GISTIC is designed for a lot of samples, you need to have at least 20 samples to get some results. loading the GISTIC peaks to IGV together with sample wise segment file should give you some idea of how reliable the result is.

      Delete