Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Showing posts with label mysql. Show all posts
Showing posts with label mysql. Show all posts

Thursday, November 21, 2013

mysql to get all the disease associated SNPs

I want to find SNPs occur at the transcription binding sites which may potentially affect the binding of the transcription factor.

I need  to have a list of SNPs that are associated with diseases first.
people usually go to dbSNP http://www.ncbi.nlm.nih.gov/SNP/
but the UCSC snp138 table hg19 is much better accessible in terms of parsing.

See posts here: http://www.biostars.org/p/1288/
http://www.biostars.org/p/7073/
http://www.biostars.org/p/11701/

find SNPs associated with OMIM gene  (by Pierre )
" Inspired by Khader's comment. The following mysql query for the mysql anonymous server at UCSC answers the SNPs in the OMIM genes:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A  -D hg18 -e '
select
   concat(left(title1,30),"..."),
   omimId,
   S.name,
   S.func,
   G.chrom,
   S.chromStart,
   S.chromEnd
from
   omimGene as G,
   omimGeneMap as M,
   snp130 as S
where
  G.name=M.omimId and
  G.chrom=S.chrom and
  S.chromStart>=G.chromStart and
  S.chromEnd <= G.chromEnd
limit 10;'
Result:
+-----------------------------------+--------+------------+--------------------+-------+------------+----------+
| concat(left(title1,30),"...")     | omimId | name       | func               | chrom | chromStart | chromEnd |
+-----------------------------------+--------+------------+--------------------+-------+------------+----------+
| Nucleolar complex-associated p... | 610770 | rs72904505 | untranslated-3     | chr1  |     869480 |   869481 |
| Nucleolar complex-associated p... | 610770 | rs6605067  | untranslated-3     | chr1  |     869538 |   869539 |
| Nucleolar complex-associated p... | 610770 | rs2839     | untranslated-3     | chr1  |     869549 |   869550 |
| Nucleolar complex-associated p... | 610770 | rs3196153  | untranslated-3     | chr1  |     869586 |   869587 |
| Nucleolar complex-associated p... | 610770 | rs1133980  | untranslated-3     | chr1  |     869614 |   869615 |
| Nucleolar complex-associated p... | 610770 | rs28453979 | untranslated-3     | chr1  |     869781 |   869782 |
| Nucleolar complex-associated p... | 610770 | rs61551591 | intron,near-gene-3 | chr1  |     870079 |   870080 |
| Nucleolar complex-associated p... | 610770 | rs3748592  | intron,near-gene-3 | chr1  |     870100 |   870101 |
| Nucleolar complex-associated p... | 610770 | rs3748593  | intron,near-gene-3 | chr1  |     870252 |   870253 |
| Nucleolar complex-associated p... | 610770 | rs74047418 | missense           | chr1  |     870364 |   870365 |
+-----------------------------------+--------+------------+--------------------+-------+------------+----------+
however the table is not available any more in the UCSC databases.
instead you should do:
also  by Pierre 
1) Register an access to the FTP site of omim: http://omim.org/downloads and download mim2gene:
$ curl -s  "ftp://anonymous:xxxxxxx@xxxxx.edu/OMIM/mim2gene.txt" | head
# Mim Number    Type    Gene IDs    Approved Gene Symbols
100050  phenotype   -   -
100070  phenotype   100329167   -
100100  phenotype   -   -
100200  phenotype   -   -
100300  phenotype   100188340   -
100500  moved/removed   -   -
100600  phenotype   -   -
100640  gene    216 ALDH1A1
100650  gene/phenotype  217 ALDH2
get a list of the gene symbols:
~$ curl -s  "ftp://anonymous:xxxxx@xxxxxx.edu/OMIM/mim2gene.txt" |\
   egrep -v "#" | cut -d '  ' -f 4 | egrep -v '^\-$' |\
   sort | uniq > list1.txt
2) get your list of SNP associiated to the gene symbol. Something like:
mysql -N --user=genome --host=genome-mysql.cse.ucsc.edu -A  -D hg19 -e 'select  distinct
  G.geneSymbol,
  S.name
from snp132 as S,
kgXref as G,
knownGene as K where
    S.chrom=K.chrom and
    S.chromStart>=K.txStart and
    S.chromEnd<=K.txEnd and
    K.name=G.kgId 
    /* AND something to restrict the result to YOUR list of SNPs or gene */
' | sort -t '    ' -k1,1 > list2.txt
3) use unix join to join the two lists:
join -1 1 -2 1 list1.txt list2.txt
you should get a list with two columns: the OMIM gene and your SNP.


I want to get a list of SNPs that are clinical associated. description of the snp138 table
http://genome.ucsc.edu/cgi-bin/hgTables
at terminal:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e ' SELECT * FROM snp138 AS s WHERE s.bitfields LIKE  "clin%" ' > clinic_associated_SNPs_hg19.txt

remember you are on the remote server of UCSC, you can not connect the database first, and do something like:

SELECT *
FROM snp138 AS s

WHERE s.bitfields LIKE "clin%"
INTO OUTFILE '/home/tommy/clinic_associated_SNPs_hg.txt'
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';


Because we do not have the write privilege
after I got the txt file, I  changed it to a bed file:

tommy@tommy-ThinkPad-T420:~$ cat clinic_associated_SNPs_hg19.txt | cut -f2-7 | head
chrom chromStart chromEnd name score strand
chr1 985954 985955 rs199476396 0 +
chr1 1199488 1199489 rs207460006 0 +
chr1 1245103 1245104 rs144003672 0 +
chr1 1265153 1265154 rs307355 0 +
chr1 1265459 1265460 rs35744813 0 -
chr1 1469330 1469331 rs145324009 0 +
chr1 1635334 1635335 rs201004006 0 +
chr1 1689555 1689556 rs207460007 0 +
chr1 1959074 1959075 rs121434580 0 +

tommy@tommy-ThinkPad-T420:~$ wc -l clinic_associated_SNPs_hg19.txt 
103174 clinic_associated_SNPs_hg19.txt

There are total 103174 SNPs in the file.
Now, I can just use bedtools to intersect the SNPs bed file with the ChIP-seq bed file generated by MACS.

tommy@tommy-ThinkPad-T420:~$ bedtools intersect -a TF.bed -b SNPs_hg19.bed -wo 

Just keep in mind the difference between the 0 based and 1 based coordinates system. See a post here:

Alternatively, you can create a database to do this kind of intersection by using Join command.

other tools

Finally a python package for Database https://dataset.readthedocs.org/en/latest/quickstart.html

Tuesday, November 19, 2013

mysql rocks

I just finished the first chapter of MySQL by Paul DuBois http://www.amazon.com/MySQL-5th-Edition-Developers-Library-ebook/dp/B00C2SFK2Q
I was really excited to explore the power of mysql. Just by following the command examples of  the two sample databases in the book, I learned a lot.

I am studying biology, so, I want to make practical use of mysql. One of the databases on top of my mind is the UCSC genome browser database http://genome.ucsc.edu/goldenPath/help/mysql.html

connect to the database:
tommy@tommy-ThinkPad-T420:~$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| ailMel1            |
| allMis1            |
| anoCar1            |
| anoCar2            |
| anoGam1            |
| apiMel1            |
............................
............................
mysql> use hg19
Database changed

find the hg19 refGene annotation table

mysql> show tables like "%ref%";
+------------------------+
| Tables_in_hg19 (%ref%) |
+------------------------+
| kgXref                 |
| kgXrefOld5             |
| kgXrefOld6             |
| refFlat                |
| refGene                |
| refLink                |
| refSeqAli              |
| refSeqStatus           |
| refSeqSummary          |
+------------------------+
9 rows in set (0.12 sec)

mysql> describe refGene;  
+--------------+------------------------------------+------+-----+---------+-------+
| Field        | Type                               | Null | Key | Default | Extra |
+--------------+------------------------------------+------+-----+---------+-------+
| bin          | smallint(5) unsigned               | NO   |     | NULL    |       |
| name         | varchar(255)                       | NO   | MUL | NULL    |       |
| chrom        | varchar(255)                       | NO   | MUL | NULL    |       |
| strand       | char(1)                            | NO   |     | NULL    |       |
| txStart      | int(10) unsigned                   | NO   |     | NULL    |       |
| txEnd        | int(10) unsigned                   | NO   |     | NULL    |       |
| cdsStart     | int(10) unsigned                   | NO   |     | NULL    |       |
| cdsEnd       | int(10) unsigned                   | NO   |     | NULL    |       |
| exonCount    | int(10) unsigned                   | NO   |     | NULL    |       |
| exonStarts   | longblob                           | NO   |     | NULL    |       |
| exonEnds     | longblob                           | NO   |     | NULL    |       |
| score        | int(11)                            | YES  |     | NULL    |       |
| name2        | varchar(255)                       | NO   | MUL | NULL    |       |
| cdsStartStat | enum('none','unk','incmpl','cmpl') | NO   |     | NULL    |       |
| cdsEndStat   | enum('none','unk','incmpl','cmpl') | NO   |     | NULL    |       |
| exonFrames   | longblob                           | NO   |     | NULL    |       |
+--------------+------------------------------------+------+-----+---------+-------+
16 rows in set (0.54 sec)

which gene has the most exons?

mysql> select chrom, name, name2, exonCount from refGene order by exonCount DESC limit 10;
+-------+--------------+-------+-----------+
| chrom | name         | name2 | exonCount |
+-------+--------------+-------+-----------+
| chr2  | NM_001267550 | TTN   |       363 |
| chr2  | NM_001256850 | TTN   |       313 |
| chr2  | NM_133378    | TTN   |       312 |
| chr2  | NM_133437    | TTN   |       192 |
| chr2  | NM_133432    | TTN   |       192 |
| chr2  | NM_003319    | TTN   |       191 |
| chr2  | NM_001271208 | NEB   |       183 |
| chr2  | NM_001164507 | NEB   |       182 |
| chr2  | NM_001164508 | NEB   |       182 |
| chr12 | NM_173600    | MUC19 |       173 |
+-------+--------------+-------+-----------+

I want the genes rather than the transcripts, so I group them by gene name (name2)


mysql> select chrom, name, name2, max(exonCount) AS maximum from refGene group by name2 order by maximum DESC limit 10;
+-------+--------------+--------------+---------+
| chrom | name         | name2        | maximum |
+-------+--------------+--------------+---------+
| chr2  | NM_001256850 | TTN          |     363 |
| chr2  | NM_001271208 | NEB          |     183 |
| chr12 | NM_173600    | MUC19        |     173 |
| chr6  | NM_033071    | SYNE1        |     146 |
| chr1  | NM_001278267 | LOC100288142 |     131 |
| chr3  | NM_000094    | COL7A1       |     118 |
| chr14 | NM_182914    | SYNE2        |     116 |
| chr1  | NM_001098623 | OBSCN        |     116 |
| chr7  | NM_198455    | SSPO         |     110 |
| chr1  | NM_031935    | HMCN1        |     107 |
+-------+--------------+--------------+---------+
10 rows in set (2.53 sec)
sanity check on UCSC genome browser:


Just curious, which gene has the longest transcript?

mysql> select chrom, name, name2, txEnd-txStart AS span  from refGene  order by span  DESC limit 10;
+-------+--------------+--------------+---------+
| chrom | name         | name2        | span    |
+-------+--------------+--------------+---------+
| chr1  | NM_001278267 | LOC100288142 | 2320934 |
| chr7  | NM_014141    | CNTNAP2      | 2304636 |
| chr9  | NM_002839    | PTPRD        | 2298478 |
| chrX  | NM_000109    | DMD          | 2220382 |
| chr11 | NM_001142699 | DLG2         | 2172259 |
| chrX  | NM_004006    | DMD          | 2092329 |
| chr8  | NM_033225    | CSMD1        | 2059454 |
| chr20 | NM_080676    | MACROD2      | 2057696 |
| chrX  | NM_004009    | DMD          | 2009201 |
| chrX  | NM_004010    | DMD          | 2009200 |
+-------+--------------+--------------+---------+
10 rows in set (0.40 sec)



These two genes span more than 2Mb!

of course, one can download the refGene table to local computer and use awk  etc to do this kind of work, but I just see how powerful and convenient mysql is especially when you have multiple tables and want to do some summarization across them.

Saturday, March 30, 2013

download data from UCSC database to local drive

I wanted to download UCSC refGene table to my local computer.  http://genome.ucsc.edu/goldenPath/help/mysql.html
However, it looks like it is not possible to direct pull data from UCSC to a local host Database.
on linux command line, mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A
SHOW Databases;
USE hg19;
SHOW tables like "%ref%";


+------------------------+

| Tables_in_hg19 (%ref%) |
+------------------------+
| kgXref                 |
| kgXrefOld5             |
| refFlat                |
| refGene                |
| refLink                |
| refSeqAli              |
| refSeqStatus           |
| refSeqSummary          |
+------------------------+
8 rows in set (1.31 sec)

select * from refGene into outfile "/export.txt";
ERROR 1045 (28000): Access denied for user 'genome'@'%' (using password: NO)

I got this error, and I googled it found that
"The SELECT ... INTO OUTFILE 'file_name' form of SELECT writes the selected 

rows to a file. The file is created on the server host, so you must have 
the FILE privilege to use this syntax.

If you want to create the resulting file on some client host other than 
the server host, you cannot use SELECT ... INTO OUTFILE. In that case, you 
should instead use a command such as

mysql -e "SELECT ..." >  file_name 

to generate the file on the client host." https://lists.soe.ucsc.edu/pipermail/genome/2009-August/019892.html


So, instead on linux command line, I typed :
mysql --user=genome --host=genome-mysql.cse.ucsc.edu hg19 -A -sre 'SELECT * from refGene' > ~/Desktop/refGene.hg19



and now it works!