Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Showing posts with label liftover. Show all posts
Showing posts with label liftover. Show all posts

Friday, April 4, 2014

liftover wig file

I was looking at a public ChIP-seq data set, but the author did not put the raw sequence data in the GEO database. Instead, they uploaded a wig file of the ChIP-seq signal file and it was mapped to human hg18 reference genome.

All the other data sets I am analyzing are mapped to hg19, so I have to liftover this wig file to hg19 also.

Solution: CrossMap! http://crossmap.sourceforge.net/

Convert Wiggle/BigWig format files

Wiggle (WIG) format is useful for displaying continuous data such as GC content and reads intensity of high-throughput sequencing data. BigWig is a self-indexed binary-format Wiggle file, and has the advantage of supporting random access. This means only regions that need to be displayed are retrieved by genome browser, and it dramatically reduces the time needed for data transferring (Kent et al., 2010). Input wiggle data can be in variableStep (for data with irregular intervals) or fixedStep (for data with regular intervals). Regardless of the input, the output will always in bedGraph format. bedGraph format is similar to wiggle format and can be converted into BigWig format using UCSC wigToBigWig tool. We export files in bedGraph because it is usually much smaller than file in wiggle format, and more importantly, CrossMap internally transforms wiggle into bedGraph to increase running speed.
If an input file is in BigWig format, the output is BigWig format if UCSC’s ‘wigToBigWig‘ executable can be found; otherwise, the output file will be in bedGraph format.
Typing command without any arguments will print help message:
$ python2.7 CrossMap.py  wig
Screen output:
Usage:
  CrossMap.py wig input_chain_file input_wig_file output_prefix

Description:
  "input_chain_file" can be regular or compressed (*.gz, *.Z, *.z, *.bz, *.bz2,
  *.bzip2) file, local file or URL (http://, https://, ftp://) pointing to remote
  file.  Both "variableStep" and "fixedStep" wiggle lines are supported. Wiggle
  format: http://genome.ucsc.edu/goldenPath/help/wiggle.html

Example:
  CrossMapy.py wig hg18ToHg19.over.chain.gz test.hg18.wig test.hg19

It is very easy to use, after install it following the instruction I did:
CrossMap.py wig hg18ToHg19.over.chain.gz my.wig my_hg19

it generates a bigwig file, a sorted bedgraph and a unsorted bedgraph file.

it took me 36mins to convert a 1.4Gb wig file on my desktop with 4Gb ram.






Saturday, April 20, 2013

Liftover between different Genome assemblies (hg18 to hg19)

see a post here:

http://manuelcorpas.com/2011/02/02/838/


Given the huge response I have at work about remapping features into another assembly, I present here an adapted version for how to remap a feature from NCBI36/hg18 to GRCh37/hg19 using UCSC’s liftOver tool.

Important:

Please make sure you know in advance the assembly to which your aberration data is currently mapped to. If by mistake you remap an aberration already in GRCh37 to GRCh37 you will get new coordinates for the region mapped to the wrong coordinates.
UCSC’s Genome Browser provides a web facility to convert coordinates from one assembly into another. To convert coordinates using their liftOver tool do the following:
  1. Make sure that your data is in BED format, e.g.  “chr3     100000  999990  myPatientId0000123” –> aberration in NCBI36/hg18
  2. Note that each field is separated by a tab and each line by a character return. Please follow this strictly or the remapping tool may throw an error.
  3. Add as many lines as aberrations you would like to remap.
  4. Go to the liftOver page
  5. Select “Original Assembly” Mar. 2006 (NCBI36/hg18) and “New Assembly” Feb. 2009 (GRCh37/hg19)
  6. Leave all other parameters (Minimum ratio of bases that must remap, etc) with default values
  7. Paste your aberration in the input box where it says “Paste in data” and hit submit
  8. To get results, scroll down the page and click on the “View Conversions” link.
  9. Here is the result I get:
chr3  125000      1024990     myPatientId0000123
Please note that your feature may not remap because the region is partially or entirely deleted in the new assembly or split in GRCh37. In this case I recommend that you use another start or end point position, maybe use the start/end of alternative probes until you find a region where it maps. Another possibility would be to look at the genes for the region in the old assembly and select a region in GRCh37 that includes the same genes as in NCBI36. Each of these solutions require careful deliberation and may not be applicable to your particular case (e.g. genes in different chromosomes would not allow remapping based on genes).
I hope this is helpful.

Tips for Remapping from NCBI36 to GRCh37 Genome Assembly

July 28, 2009 § 5 Comments
It might seem for some people straight forward but I had to spend quite some time trying to understand how to remap my array probes from ncbi36 to CGRCh37. If you use the Ensembl genome browser, you might have noticed that from July 2009 the ncbi37 assembly is now in use. For DECIPHER (the database I help develop), this is a little bit of a headache, because it means that all of the probes from array CGH that we used have to be remapped to the new assembly. If this does not interest you I recommend that you stop reading here.
First I learned that there is a program called liftOver by UCSC that is able to do this remapping. Since the amount of probes I have to map (around 6 million) is a number that I would not wish to through to anyone’s server, I decided to do this in-house. You can download this program from here. I did not know which was the right binary for me to download, as they had linux32 and linux64 versions. I decided to go for the former, since I am using debian and it sounds like a conservative option.
Once I downloaded the program, I needed to make it executable:
chmod u+x liftOver
OK, so I was in a position to run it:
./liftOver
In the usage information it appears that I need several arguments and files to be able to run this program correctly:
liftOver oldFile map.chain newFile unMapped
Now I learned that I need also to get a file called the map.chain. I was not sure what it meant. I learned that this map.chain file has parameters that are used by liftOver and that there are map.chain files depending on the remapping one wants to do. In my case, I want to remap from ncbi36 to GRCh37 in human. However, when I look at the different remappings, I do not see ncbi formats anywhere. I learned here that what I am looking for is map chain file that is called this:
hg18toHg19.over.chain
Apparently hg18 refers to ncbi36 and hg19 to ncbi37. Doing a google search I could find that file here.
Now I get quite a few options and learn that I need to have my probes in bed format to run liftOver. Apparently there are quite a few formats I can use according to UCSC FAQs formats. Here an example of what my bed file looks like (chromosome-tab-start_position-tab-end_position):
chrY       12308579        12468100
chrY       12468101        12581699
chrY       12581700        12759636
chrY       12759637        12838587
Now I am in a position to run liftOver. I notice now that in the usage one has the following description:
liftOver oldFile map.chain newFile unMapped
‘newFile’ and ‘unMapped’ are the names of the files where the output goes into and therefore are empty. This can be confusing as the user might think that these are some other kind of files one has to get hold of.
OK, so now I am ready to transform our old array probe mapping ncbi36 to the new ncbi37 one:
./liftOver probes.ncbi36 hg18toHg19.over.chain probes.grch37 unmapped-to-grch37
I got the following output to console:
Reading liftover chains
Mapping coordinates
ERROR: start coordinate is after end coordinate (chromStart > chromEnd) on line 5171240 of bed file probes.decipher.ncbi36
ERROR: 4 2515512 2515453
…which is a bit worrying.
I’ve gone through my probes and found that some of them (just 44757!) had start point coordinates greater than their ends. I guess that if you encounter those you’ll have to decide what to do. For the time being I just took them out and re run liftOver again.
This time it worked.
--------------------------------------------------------------------------------------------------------------------------------------------------------
You can find the liftover files here:
web based:
Galaxy https://main.g2.bx.psu.edu/ also has a function tool for that purpose.