Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Wednesday, November 22, 2017

run gistic2 with sequenza segmentation output from whole exome sequencing

Convert sequenza output to gistic input

Gistic was designed for SNP6 array data. I saw many papers use it for whole exome sequencing data as well.
I have the segment files from sequenza and want to convert them to the gistic input.

Input format for gistic:

segment file:
(1) Sample (sample name)
(2) Chromosome (chromosome number)
(3) Start Position (segment start position, in bases)
(4) End Position (segment end position, in bases)
(5) Num markers (number of markers in segment)
(6) Seg.CN (log2() -1 of copy number)
  1. The conversion should be log2 (logarithm base 2) - 1, so that copy number 2 is 0.
  2. Every segment start and end in the segments file should appear in the markers file, not the other way around.
when the copy number is 0 (a homozygous deletion of both copies). You can’t do a log2(0)-1, just put a small number e.g. -5
(1) Marker Name
(2) Chromosome
(3) Marker Position (in bases)
Note gistic2 does not require a marker file anymore.

output of sequenza

sequenza gives a segment file. Segmentation was done by copynumberbioconductor package.
13 columns of the *segments.txt file
"chromosome" "start.pos" "end.pos" "Bf" "N.BAF" "sd.BAF" "depth.ratio" "N.ratio" "sd.ratio" "CNt" "A" "B" "LPP"
We only need the chromosomestart.posend.posN.BAF and depth.ratio columns.
The depth.ratio column is the GC content normalized ratio. a depth ratio of 1 means it has copy number of 2 (the same as the normal blood control in my case).
UPDATED: 12/17/2017. see a comment below. it is not log2(2^ depth.ratio) -1 rather:
To convert to gistic input, I have to do log2(2 * depth.ratio) - 1

UPDATED 01/03/2018
I have a bunch of sgement files in the same folder.
add the sample name in the final column and do the log2 math in R.
library(tidyverse)
library(readr)
seg_files<- list.files(".", pattern = "*segments.txt", full.names = F) 

seg_dat_list <- lapply(seg_files, function(f) {
        dat<- read_tsv(f, col_names = T, col_types = cols(.default = col_character()))
        sample<- gsub("_vs_.*segments.txt", "", f)
        dat$sample<- sample
        return(dat)
})

seg_dat <- do.call(rbind, seg_dat_list)

gistic_input<- seg_dat %>% select(sample, chromosome, start.pos, end.pos, N.BAF, depth.ratio) %>% mutate(depth.ratio = as.numeric(depth.ratio)) %>% mutate(depth.ratio = log2(2 * depth.ratio) -1)

write_tsv(gistic_input, "all_segments.txt")
Back to bash:


## marker file:

cat all_segments.txt | sed '1d' | cut -f2,3 > markers.txt
cat all_segments.txt | sed '1d' | cut -f2,4 >> markers.txt

## sort the files by chromosome, take the unique ones and number the markers.

cat markers.txt | sort -V -k1,1 -k2,2nr | uniq | nl > markers_gistic.txt

Run gistic
modify the gistic2 script a bit. e.g. change MCR_ROOT folder path
#!/bin/sh
## set MCR environment and launch GISTIC executable

## NOTE: change the line below if you have installed the Matlab MCR in an alternative location
MCR_ROOT=/scratch/genomic_med/apps/Matlab_Complier_runTime
MCR_VER=v83

echo Setting Matlab MCR root to $MCR_ROOT

## set up environment variables
LD_LIBRARY_PATH=$MCR_ROOT/$MCR_VER/runtime/glnxa64:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=$MCR_ROOT/$MCR_VER/bin/glnxa64:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=$MCR_ROOT/$MCR_VER/sys/os/glnxa64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH
XAPPLRESDIR=$MCR_ROOT/$MCR_VER/MATLAB_Component_Runtime/v83/X11/app-defaults
export XAPPLRESDIR

## launch GISTIC executable
./gp_gistic2_from_seg $@

I removed ./ from the last line since I have put all executables in my path.
mkdir gistic_out
gistic2 -b gistic_out -seg all_segments.txt -refgene /scratch/genomic_med/apps/gistic/refgenefiles/hg19.mat -mk markers_gistic.txt  -genegistic 1 -smallmem 1 -broad 1 -brlen 0.5 -conf 0.90 -armpeel 1 -savegene 1 -gcm extreme

25 comments:

  1. Hi Tommy,

    Are you sure about log2(2^depth.ratio) - 1 = depth.ratio -1 ?
    I would take log2(2*depth.ratio) - 1 instead, because depth.ratio is not log-transformed in sequenza output.

    ReplyDelete
    Replies
    1. Hi Olga, many thanks for your comment, I think you are right. will update the post!

      Delete
    2. Hi Olga, I just updated the post. thanks again.

      Delete
    3. Hi Tommy and Olga,
      Can you please explain why we use log2(2 * depth.ratio) instead of log2(2 * CNt) from the sequenza segment file?

      Delete
    4. Hi,

      log2(2*depth.ratio) - 1=log2(2)+log2(depth.ratio)-1=log2(depth.ratio).

      Isn't it?

      Delete
    5. Hi,
      An article of nautre published in 2019 also use log2(2 * depth.ratio)-1 instead of log2(2 * CNt)-1 to transform the results from sequenza.Llink is here: https://www.nature.com/articles/s41467-018-08081-1
      I also don't know the specific reason, but it is a solid support for this transform method.

      Delete
    6. Hi cnv,
      That's right. TCGA pipeline also use log2(depth.ratio), actually it equals to log2(CNV)-1.

      Delete
    7. Can you share the reference paper for TCGA pipeline?

      Delete
  2. Super nice, thank you!
    I am running GISTIC standalone 2.02.3.

    I got this error
    Shortened 12 segments in '/vol5/home/wangdo/mybin/GISTIC/myfiles/segmentationfile.txt' that overlap by one marker.

    I wanna know which lines are causing this error. Do you know how can we get detail error message such as GISTIC on genepattern?

    ReplyDelete
    Replies
    1. I am not sure what is this error, ask the GISTIC google group, they should help. I never used GISTIC on genepattern, sorry can not help that.

      Delete
  3. I was able to successfully run GISTIC 2.0 using Sequenza outputs following your guidelines above. When I look at the significantly recurrent CNV regions identified by GISTIC on IGV, however, I cannot seem to agree with the GISTIC results by visual inspection. According to the Sequenza vignette, the field 'CNt' denotes the estimated copy number of tumor. Could you please describe the difference between depth.ratio and CNt? When I calculate tumor copy number using depth.ratio based on your formula above (log2(2 * depth.ratio) - 1) it does not match the corresponding CNt value in each row in the Sequenza output.

    ReplyDelete
    Replies
    1. As an example, I have a row where depth.ratio is 0.096 and CNt is 2. Therefore log2(2 * 0.096) - 1 is -3.38, and not 2.

      Delete
    2. Hi Andy, see a post here https://groups.google.com/forum/#!topic/sequenza-user-group/LAeG7BPeHEo I had the same question as you have.

      >Generally you are right, but it depends on many things: for instance if your ploidy estimate is more then 2 (more then diploid), then your assumptions is wrong.
      For instance if you have a ploidy estimate of 5 or more, then your CNt for a segment with ~0.9 depth.ratio would be around 4.
      Everything depends also on the purity estimate, a step of 0.1 (1-0.9) can be a lot or not too much… there are not fix ratio values that are convertible to specific copy number values, everything is inferred by a model that depend on few parameters. Depending on the parameters the results can change quite drastically.

      Delete
    3. Thank you for your prompt response and explanation. I see that you cited the guidelines in this link (https://github.com/mbourgey/EBI_cancer_workshop_CNV) in your post. When you ran the Sequenza+GISTIC pipeline, were you able to locate/verify the recurrent CNVs on IGV? What is the best way to visually confirm this? By the way, I'm using whole exome sequencing data.

      Delete
    4. remember GISTIC is designed for a lot of samples, you need to have at least 20 samples to get some results. loading the GISTIC peaks to IGV together with sample wise segment file should give you some idea of how reliable the result is.

      Delete
  4. Hi Dear,
    Thanks for sharing such a useful blog. Really! This Blog is very informative for us which contains a lot of information about the Counselling For Men. I like this post. Please visit at "Bioinformatics genomics",We help you find information about Bioinformatics, Proteomics Genomics and Bioinformatics on Data-incites.com. We bring world class talent and deep experience to your projects as evidenced by our extensive list of achievements.

    Visit Here - <a href='https://data-incites.com/current-projects>https://data-incites.com/current-projects </a>Thanks

    ReplyDelete
  5. Hi All, Thanks for tutorial.

    Do we need to depth.ratio by ploidy? It is mentioned in following paper with similar steps they said "To avoid bias from different sample ploidy, depth.ratios were further divided by the ploidy of each sample to correct for ploidy." https://www.nature.com/articles/s41588-019-0569-6#Sec12

    ReplyDelete
  6. Hi Dear,

    I Like Your Blog Very Much. I see Daily Your Blog, is A Very Useful For me.

    You can also Find Data incites Data Incites is a Seattle based consulting group providing world class data science expertise and data science for the life sciences. Our Services are included bioinformatics, biostatistics, machine learning, clinical research, data analysis, data visualization

    Visit Now:- https://data-incites.com/

    ReplyDelete
  7. Hi,

    Thanks for sharing your blog. I have a question. If I use "log2(CNt/ploidy)" as GISTIC input, then the output of GISTIC is (CNt-ploidy)/ploidy. Suppose the sample ploidy=4, if the value of chr1 is -1, it meaning that the CN of chr1 is 2. Is this calculation way reasonable?

    ReplyDelete
    Replies
    1. Sorry. If I use "log2(CNt/ploidy)" as GISTIC input, then the output of GISTIC is 2*(CNt-ploidy)/ploidy

      Delete
  8. How to Play the Satta King Game
    The Satta King Game is a lottery game in which players bet on collections of numbers ranging from 0 to 99. This game has a lot of similarities with lottery games. It is a popular type of lottery that can be played both online and offline. The Satta King is a kind of betting lottery. The players choose a number from the hundred available and bet on it. If they win, they are awarded a sum of 90 multiple rupees. This game is popular throughout India.
    To start playing this game, you need to first log in to the satta matka King website. Click on the play now button and then create a username and password. Remember to choose a username that is not public and is not your family member's name. Once you've created a username and password, you can choose a game and click on the register button. This way, you'll be able to log in and start playing right away.
    After logging into the Satta King website, you can begin playing the matka result King game. To register, click the play now button. To create your account, enter your username and password. Then, select the game you want to play and click the register button. This will open your profile for other players to see. satta result You should always be careful when choosing your user name and password. This is important because you don't want your friends and family to know who you are and what you're doing.
    The Satta King Game has two main elements - the Satta kalyan matka result and the Satta king record chart. You can use the record chart to predict the Satta result that's about to open. There is also a fixed timing for each Satta kin game. Once you've played your Satta game, you'll be able to see the Satta khana and record chart in advance.
    After you've chosen your numbers, you'll need to enter them. In the satka matka King game, players choose one number from one hundred. If they're right, they'll receive ninety multiple rupees. If they don't win, they'll be given eighty times the original amount. As a result, this game is a form of lottery. While the Satta King game may seem to be illegal in some countries, it is legal in India.
    When you play the Satta King game online, you'll be satta matka result for winning. However, you'll be required to invest at least 30% of your money to win. This is an essential aspect of this game because you don't want to lose your money. You need to be able to quit when satka matka you're near the maximum amount of money. The more you play, the higher your chance of winning. The Satta King Game is a fun way to spend your free time, and it is a fun way to pass the time.
    The Satta King Game is a great way to keep your satta result active. You can spend hours playing the game with friends and family, and the Satta King game is a great way to do so. The online version of the game is free to play and there are many ways to play. The only real disadvantage is that you have to be patient and have a lot of patience. If you want to win, you should invest some time to learn the basics of the game.
    You can play the

    ReplyDelete
  9. เว็บ pg สล็อต คือเกมสล็อตออนไลน์ที่เป็นไข้ใหญ่ค่ายหนึ่งที่ได้รับความนิยมอย่างมากในหมู่นักพนันการเล่นพนันและก็นักเล่นเกมสล็อตโดยเฉพาะเกมประเภทสล็อตเท่านั้นที่จะเป็นเกมที่ได้รับความนิยมสูงสุดที่ PG SLOT

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. สล็อตเว็บตรง เป็นสล็อตออนไลน์อันดับหนึ่งที่มีคนให้ความสนใจและให้ความชื่นชมกับตัวเกมส์ PG slot กันเป็นอย่างมากและเป็นเกมส์สล็อตออนไลน์ที่มีระบบพัฒนาดีอย่างต่อเนื่องและเป็น PG slotที่ดึงดูดคนมาเล่นและมาเป็นครอบครัวเดียวกับ PG slot

    ReplyDelete
  12. why not use N.ratio but N.BAF?

    ReplyDelete