Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Wednesday, November 22, 2017

run gistic2 with sequenza segmentation output from whole exome sequencing

Convert sequenza output to gistic input

Gistic was designed for SNP6 array data. I saw many papers use it for whole exome sequencing data as well.
I have the segment files from sequenza and want to convert them to the gistic input.

Input format for gistic:

segment file:
(1) Sample (sample name)
(2) Chromosome (chromosome number)
(3) Start Position (segment start position, in bases)
(4) End Position (segment end position, in bases)
(5) Num markers (number of markers in segment)
(6) Seg.CN (log2() -1 of copy number)
  1. The conversion should be log2 (logarithm base 2) - 1, so that copy number 2 is 0.
  2. Every segment start and end in the segments file should appear in the markers file, not the other way around.
when the copy number is 0 (a homozygous deletion of both copies). You can’t do a log2(0)-1, just put a small number e.g. -5
(1) Marker Name
(2) Chromosome
(3) Marker Position (in bases)
Note gistic2 does not require a marker file anymore.

output of sequenza

sequenza gives a segment file. Segmentation was done by copynumberbioconductor package.
13 columns of the *segments.txt file
"chromosome" "start.pos" "end.pos" "Bf" "N.BAF" "sd.BAF" "depth.ratio" "N.ratio" "sd.ratio" "CNt" "A" "B" "LPP"
We only need the chromosomestart.posend.posN.BAF and depth.ratio columns.
The depth.ratio column is the GC content normalized ratio. a depth ratio of 1 means it has copy number of 2 (the same as the normal blood control in my case).
UPDATED: 12/17/2017. see a comment below. it is not log2(2^ depth.ratio) -1 rather:
To convert to gistic input, I have to do log2(2 * depth.ratio) - 1

UPDATED 01/03/2018
I have a bunch of sgement files in the same folder.
add the sample name in the final column and do the log2 math in R.
seg_files<- list.files(".", pattern = "*segments.txt", full.names = F) 

seg_dat_list <- lapply(seg_files, function(f) {
        dat<- read_tsv(f, col_names = T, col_types = cols(.default = col_character()))
        sample<- gsub("_vs_.*segments.txt", "", f)
        dat$sample<- sample

seg_dat <-, seg_dat_list)

gistic_input<- seg_dat %>% select(sample, chromosome, start.pos, end.pos, N.BAF, depth.ratio) %>% mutate(depth.ratio = as.numeric(depth.ratio)) %>% mutate(depth.ratio = log2(2 * depth.ratio) -1)

write_tsv(gistic_input, "all_segments.txt")
Back to bash:

## marker file:

cat all_segments.txt | sed '1d' | cut -f2,3 > markers.txt
cat all_segments.txt | sed '1d' | cut -f2,4 >> markers.txt

## sort the files by chromosome, take the unique ones and number the markers.

cat markers.txt | sort -V -k1,1 -k2,2nr | uniq | nl > markers_gistic.txt

Run gistic
modify the gistic2 script a bit. e.g. change MCR_ROOT folder path
## set MCR environment and launch GISTIC executable

## NOTE: change the line below if you have installed the Matlab MCR in an alternative location

echo Setting Matlab MCR root to $MCR_ROOT

## set up environment variables

## launch GISTIC executable
./gp_gistic2_from_seg $@

I removed ./ from the last line since I have put all executables in my path.
mkdir gistic_out
gistic2 -b gistic_out -seg all_segments.txt -refgene /scratch/genomic_med/apps/gistic/refgenefiles/hg19.mat -mk markers_gistic.txt  -genegistic 1 -smallmem 1 -broad 1 -brlen 0.5 -conf 0.90 -armpeel 1 -savegene 1 -gcm extreme


  1. Hi Tommy,

    Are you sure about log2(2^depth.ratio) - 1 = depth.ratio -1 ?
    I would take log2(2*depth.ratio) - 1 instead, because depth.ratio is not log-transformed in sequenza output.

    1. Hi Olga, many thanks for your comment, I think you are right. will update the post!

    2. Hi Olga, I just updated the post. thanks again.

    3. Hi Tommy and Olga,
      Can you please explain why we use log2(2 * depth.ratio) instead of log2(2 * CNt) from the sequenza segment file?

    4. Hi,

      log2(2*depth.ratio) - 1=log2(2)+log2(depth.ratio)-1=log2(depth.ratio).

      Isn't it?

    5. Hi,
      An article of nautre published in 2019 also use log2(2 * depth.ratio)-1 instead of log2(2 * CNt)-1 to transform the results from sequenza.Llink is here:
      I also don't know the specific reason, but it is a solid support for this transform method.

    6. Hi cnv,
      That's right. TCGA pipeline also use log2(depth.ratio), actually it equals to log2(CNV)-1.

    7. Can you share the reference paper for TCGA pipeline?

  2. Super nice, thank you!
    I am running GISTIC standalone 2.02.3.

    I got this error
    Shortened 12 segments in '/vol5/home/wangdo/mybin/GISTIC/myfiles/segmentationfile.txt' that overlap by one marker.

    I wanna know which lines are causing this error. Do you know how can we get detail error message such as GISTIC on genepattern?

    1. I am not sure what is this error, ask the GISTIC google group, they should help. I never used GISTIC on genepattern, sorry can not help that.

  3. I was able to successfully run GISTIC 2.0 using Sequenza outputs following your guidelines above. When I look at the significantly recurrent CNV regions identified by GISTIC on IGV, however, I cannot seem to agree with the GISTIC results by visual inspection. According to the Sequenza vignette, the field 'CNt' denotes the estimated copy number of tumor. Could you please describe the difference between depth.ratio and CNt? When I calculate tumor copy number using depth.ratio based on your formula above (log2(2 * depth.ratio) - 1) it does not match the corresponding CNt value in each row in the Sequenza output.

    1. As an example, I have a row where depth.ratio is 0.096 and CNt is 2. Therefore log2(2 * 0.096) - 1 is -3.38, and not 2.

    2. Hi Andy, see a post here!topic/sequenza-user-group/LAeG7BPeHEo I had the same question as you have.

      >Generally you are right, but it depends on many things: for instance if your ploidy estimate is more then 2 (more then diploid), then your assumptions is wrong.
      For instance if you have a ploidy estimate of 5 or more, then your CNt for a segment with ~0.9 depth.ratio would be around 4.
      Everything depends also on the purity estimate, a step of 0.1 (1-0.9) can be a lot or not too much… there are not fix ratio values that are convertible to specific copy number values, everything is inferred by a model that depend on few parameters. Depending on the parameters the results can change quite drastically.

    3. Thank you for your prompt response and explanation. I see that you cited the guidelines in this link ( in your post. When you ran the Sequenza+GISTIC pipeline, were you able to locate/verify the recurrent CNVs on IGV? What is the best way to visually confirm this? By the way, I'm using whole exome sequencing data.

    4. remember GISTIC is designed for a lot of samples, you need to have at least 20 samples to get some results. loading the GISTIC peaks to IGV together with sample wise segment file should give you some idea of how reliable the result is.

  4. pas cher moncler vestes, combinant un style élégant et une technologie de pointe, une variété de styles de pas cher moncler enfants chemise, le pointeur se place entre votre style gustatif exclusif.

  5. Hi Dear,
    Thanks for sharing such a useful blog. Really! This Blog is very informative for us which contains a lot of information about the Counselling For Men. I like this post. Please visit at "Bioinformatics genomics",We help you find information about Bioinformatics, Proteomics Genomics and Bioinformatics on We bring world class talent and deep experience to your projects as evidenced by our extensive list of achievements.

    Visit Here - <a href='> </a>Thanks

  6. Hi All, Thanks for tutorial.

    Do we need to depth.ratio by ploidy? It is mentioned in following paper with similar steps they said "To avoid bias from different sample ploidy, depth.ratios were further divided by the ploidy of each sample to correct for ploidy."

  7. tanie damskie gucci, łącząc elegancki styl i najnowocześniejszą technologię, różnorodność stylów tanie gucci etui na karty, wskaźnik przechodzi między Twoim ekskluzywnym stylem smaku.

  8. Hi Dear,

    I Like Your Blog Very Much. I see Daily Your Blog, is A Very Useful For me.

    You can also Find Data incites Data Incites is a Seattle based consulting group providing world class data science expertise and data science for the life sciences. Our Services are included bioinformatics, biostatistics, machine learning, clinical research, data analysis, data visualization

    Visit Now:-

  9. Hi,

    Thanks for sharing your blog. I have a question. If I use "log2(CNt/ploidy)" as GISTIC input, then the output of GISTIC is (CNt-ploidy)/ploidy. Suppose the sample ploidy=4, if the value of chr1 is -1, it meaning that the CN of chr1 is 2. Is this calculation way reasonable?

    1. Sorry. If I use "log2(CNt/ploidy)" as GISTIC input, then the output of GISTIC is 2*(CNt-ploidy)/ploidy