Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Thursday, November 21, 2013

mysql to get all the disease associated SNPs

I want to find SNPs occur at the transcription binding sites which may potentially affect the binding of the transcription factor.

I need  to have a list of SNPs that are associated with diseases first.
people usually go to dbSNP http://www.ncbi.nlm.nih.gov/SNP/
but the UCSC snp138 table hg19 is much better accessible in terms of parsing.

See posts here: http://www.biostars.org/p/1288/
http://www.biostars.org/p/7073/
http://www.biostars.org/p/11701/

find SNPs associated with OMIM gene  (by Pierre )
" Inspired by Khader's comment. The following mysql query for the mysql anonymous server at UCSC answers the SNPs in the OMIM genes:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A  -D hg18 -e '
select
   concat(left(title1,30),"..."),
   omimId,
   S.name,
   S.func,
   G.chrom,
   S.chromStart,
   S.chromEnd
from
   omimGene as G,
   omimGeneMap as M,
   snp130 as S
where
  G.name=M.omimId and
  G.chrom=S.chrom and
  S.chromStart>=G.chromStart and
  S.chromEnd <= G.chromEnd
limit 10;'
Result:
+-----------------------------------+--------+------------+--------------------+-------+------------+----------+
| concat(left(title1,30),"...")     | omimId | name       | func               | chrom | chromStart | chromEnd |
+-----------------------------------+--------+------------+--------------------+-------+------------+----------+
| Nucleolar complex-associated p... | 610770 | rs72904505 | untranslated-3     | chr1  |     869480 |   869481 |
| Nucleolar complex-associated p... | 610770 | rs6605067  | untranslated-3     | chr1  |     869538 |   869539 |
| Nucleolar complex-associated p... | 610770 | rs2839     | untranslated-3     | chr1  |     869549 |   869550 |
| Nucleolar complex-associated p... | 610770 | rs3196153  | untranslated-3     | chr1  |     869586 |   869587 |
| Nucleolar complex-associated p... | 610770 | rs1133980  | untranslated-3     | chr1  |     869614 |   869615 |
| Nucleolar complex-associated p... | 610770 | rs28453979 | untranslated-3     | chr1  |     869781 |   869782 |
| Nucleolar complex-associated p... | 610770 | rs61551591 | intron,near-gene-3 | chr1  |     870079 |   870080 |
| Nucleolar complex-associated p... | 610770 | rs3748592  | intron,near-gene-3 | chr1  |     870100 |   870101 |
| Nucleolar complex-associated p... | 610770 | rs3748593  | intron,near-gene-3 | chr1  |     870252 |   870253 |
| Nucleolar complex-associated p... | 610770 | rs74047418 | missense           | chr1  |     870364 |   870365 |
+-----------------------------------+--------+------------+--------------------+-------+------------+----------+
however the table is not available any more in the UCSC databases.
instead you should do:
also  by Pierre 
1) Register an access to the FTP site of omim: http://omim.org/downloads and download mim2gene:
$ curl -s  "ftp://anonymous:xxxxxxx@xxxxx.edu/OMIM/mim2gene.txt" | head
# Mim Number    Type    Gene IDs    Approved Gene Symbols
100050  phenotype   -   -
100070  phenotype   100329167   -
100100  phenotype   -   -
100200  phenotype   -   -
100300  phenotype   100188340   -
100500  moved/removed   -   -
100600  phenotype   -   -
100640  gene    216 ALDH1A1
100650  gene/phenotype  217 ALDH2
get a list of the gene symbols:
~$ curl -s  "ftp://anonymous:xxxxx@xxxxxx.edu/OMIM/mim2gene.txt" |\
   egrep -v "#" | cut -d '  ' -f 4 | egrep -v '^\-$' |\
   sort | uniq > list1.txt
2) get your list of SNP associiated to the gene symbol. Something like:
mysql -N --user=genome --host=genome-mysql.cse.ucsc.edu -A  -D hg19 -e 'select  distinct
  G.geneSymbol,
  S.name
from snp132 as S,
kgXref as G,
knownGene as K where
    S.chrom=K.chrom and
    S.chromStart>=K.txStart and
    S.chromEnd<=K.txEnd and
    K.name=G.kgId 
    /* AND something to restrict the result to YOUR list of SNPs or gene */
' | sort -t '    ' -k1,1 > list2.txt
3) use unix join to join the two lists:
join -1 1 -2 1 list1.txt list2.txt
you should get a list with two columns: the OMIM gene and your SNP.


I want to get a list of SNPs that are clinical associated. description of the snp138 table
http://genome.ucsc.edu/cgi-bin/hgTables
at terminal:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e ' SELECT * FROM snp138 AS s WHERE s.bitfields LIKE  "clin%" ' > clinic_associated_SNPs_hg19.txt

remember you are on the remote server of UCSC, you can not connect the database first, and do something like:

SELECT *
FROM snp138 AS s

WHERE s.bitfields LIKE "clin%"
INTO OUTFILE '/home/tommy/clinic_associated_SNPs_hg.txt'
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';


Because we do not have the write privilege
after I got the txt file, I  changed it to a bed file:

tommy@tommy-ThinkPad-T420:~$ cat clinic_associated_SNPs_hg19.txt | cut -f2-7 | head
chrom chromStart chromEnd name score strand
chr1 985954 985955 rs199476396 0 +
chr1 1199488 1199489 rs207460006 0 +
chr1 1245103 1245104 rs144003672 0 +
chr1 1265153 1265154 rs307355 0 +
chr1 1265459 1265460 rs35744813 0 -
chr1 1469330 1469331 rs145324009 0 +
chr1 1635334 1635335 rs201004006 0 +
chr1 1689555 1689556 rs207460007 0 +
chr1 1959074 1959075 rs121434580 0 +

tommy@tommy-ThinkPad-T420:~$ wc -l clinic_associated_SNPs_hg19.txt 
103174 clinic_associated_SNPs_hg19.txt

There are total 103174 SNPs in the file.
Now, I can just use bedtools to intersect the SNPs bed file with the ChIP-seq bed file generated by MACS.

tommy@tommy-ThinkPad-T420:~$ bedtools intersect -a TF.bed -b SNPs_hg19.bed -wo 

Just keep in mind the difference between the 0 based and 1 based coordinates system. See a post here:

Alternatively, you can create a database to do this kind of intersection by using Join command.

other tools

Finally a python package for Database https://dataset.readthedocs.org/en/latest/quickstart.html

Tuesday, November 19, 2013

mysql rocks

I just finished the first chapter of MySQL by Paul DuBois http://www.amazon.com/MySQL-5th-Edition-Developers-Library-ebook/dp/B00C2SFK2Q
I was really excited to explore the power of mysql. Just by following the command examples of  the two sample databases in the book, I learned a lot.

I am studying biology, so, I want to make practical use of mysql. One of the databases on top of my mind is the UCSC genome browser database http://genome.ucsc.edu/goldenPath/help/mysql.html

connect to the database:
tommy@tommy-ThinkPad-T420:~$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| ailMel1            |
| allMis1            |
| anoCar1            |
| anoCar2            |
| anoGam1            |
| apiMel1            |
............................
............................
mysql> use hg19
Database changed

find the hg19 refGene annotation table

mysql> show tables like "%ref%";
+------------------------+
| Tables_in_hg19 (%ref%) |
+------------------------+
| kgXref                 |
| kgXrefOld5             |
| kgXrefOld6             |
| refFlat                |
| refGene                |
| refLink                |
| refSeqAli              |
| refSeqStatus           |
| refSeqSummary          |
+------------------------+
9 rows in set (0.12 sec)

mysql> describe refGene;  
+--------------+------------------------------------+------+-----+---------+-------+
| Field        | Type                               | Null | Key | Default | Extra |
+--------------+------------------------------------+------+-----+---------+-------+
| bin          | smallint(5) unsigned               | NO   |     | NULL    |       |
| name         | varchar(255)                       | NO   | MUL | NULL    |       |
| chrom        | varchar(255)                       | NO   | MUL | NULL    |       |
| strand       | char(1)                            | NO   |     | NULL    |       |
| txStart      | int(10) unsigned                   | NO   |     | NULL    |       |
| txEnd        | int(10) unsigned                   | NO   |     | NULL    |       |
| cdsStart     | int(10) unsigned                   | NO   |     | NULL    |       |
| cdsEnd       | int(10) unsigned                   | NO   |     | NULL    |       |
| exonCount    | int(10) unsigned                   | NO   |     | NULL    |       |
| exonStarts   | longblob                           | NO   |     | NULL    |       |
| exonEnds     | longblob                           | NO   |     | NULL    |       |
| score        | int(11)                            | YES  |     | NULL    |       |
| name2        | varchar(255)                       | NO   | MUL | NULL    |       |
| cdsStartStat | enum('none','unk','incmpl','cmpl') | NO   |     | NULL    |       |
| cdsEndStat   | enum('none','unk','incmpl','cmpl') | NO   |     | NULL    |       |
| exonFrames   | longblob                           | NO   |     | NULL    |       |
+--------------+------------------------------------+------+-----+---------+-------+
16 rows in set (0.54 sec)

which gene has the most exons?

mysql> select chrom, name, name2, exonCount from refGene order by exonCount DESC limit 10;
+-------+--------------+-------+-----------+
| chrom | name         | name2 | exonCount |
+-------+--------------+-------+-----------+
| chr2  | NM_001267550 | TTN   |       363 |
| chr2  | NM_001256850 | TTN   |       313 |
| chr2  | NM_133378    | TTN   |       312 |
| chr2  | NM_133437    | TTN   |       192 |
| chr2  | NM_133432    | TTN   |       192 |
| chr2  | NM_003319    | TTN   |       191 |
| chr2  | NM_001271208 | NEB   |       183 |
| chr2  | NM_001164507 | NEB   |       182 |
| chr2  | NM_001164508 | NEB   |       182 |
| chr12 | NM_173600    | MUC19 |       173 |
+-------+--------------+-------+-----------+

I want the genes rather than the transcripts, so I group them by gene name (name2)


mysql> select chrom, name, name2, max(exonCount) AS maximum from refGene group by name2 order by maximum DESC limit 10;
+-------+--------------+--------------+---------+
| chrom | name         | name2        | maximum |
+-------+--------------+--------------+---------+
| chr2  | NM_001256850 | TTN          |     363 |
| chr2  | NM_001271208 | NEB          |     183 |
| chr12 | NM_173600    | MUC19        |     173 |
| chr6  | NM_033071    | SYNE1        |     146 |
| chr1  | NM_001278267 | LOC100288142 |     131 |
| chr3  | NM_000094    | COL7A1       |     118 |
| chr14 | NM_182914    | SYNE2        |     116 |
| chr1  | NM_001098623 | OBSCN        |     116 |
| chr7  | NM_198455    | SSPO         |     110 |
| chr1  | NM_031935    | HMCN1        |     107 |
+-------+--------------+--------------+---------+
10 rows in set (2.53 sec)
sanity check on UCSC genome browser:


Just curious, which gene has the longest transcript?

mysql> select chrom, name, name2, txEnd-txStart AS span  from refGene  order by span  DESC limit 10;
+-------+--------------+--------------+---------+
| chrom | name         | name2        | span    |
+-------+--------------+--------------+---------+
| chr1  | NM_001278267 | LOC100288142 | 2320934 |
| chr7  | NM_014141    | CNTNAP2      | 2304636 |
| chr9  | NM_002839    | PTPRD        | 2298478 |
| chrX  | NM_000109    | DMD          | 2220382 |
| chr11 | NM_001142699 | DLG2         | 2172259 |
| chrX  | NM_004006    | DMD          | 2092329 |
| chr8  | NM_033225    | CSMD1        | 2059454 |
| chr20 | NM_080676    | MACROD2      | 2057696 |
| chrX  | NM_004009    | DMD          | 2009201 |
| chrX  | NM_004010    | DMD          | 2009200 |
+-------+--------------+--------------+---------+
10 rows in set (0.40 sec)



These two genes span more than 2Mb!

of course, one can download the refGene table to local computer and use awk  etc to do this kind of work, but I just see how powerful and convenient mysql is especially when you have multiple tables and want to do some summarization across them.

Saturday, November 16, 2013

ChIP-exo data analysis

Our neighboring lab just generated some ChIP-exo data, (if you do not know the technique, look at this paper http://www.ncbi.nlm.nih.gov/pubmed/23026909) and the company did some analysis but not what they want. The bedgraph files generated by the company were separated for plus strand and minus strand. They want just peak files like in the ChIP-seq experiment.

I was aware of several software can be used to deal with this type of data:
Tao Liu's MACS https://github.com/taoliu/MACS/
this is probably the most widely used method for ChIP-seq peak calling, and it is also suitable for ChIP-exo data
https://github.com/taoliu/MACS/issues/15

Peakzilla https://github.com/steinmann/peakzilla
A new tool for ChIP-exo

"Peakzilla identifies sites of enrichment and transcription factor binding sites from transcription factor ChIP-seq and ChIP-exo experiments at hight accuracy and resolution. It is designed to perform equally well for data from any species. All necessary parameters are estimated from the data. Peakzilla is suitable for both single and paired end data from any sequencing platform."

A quick google I found several others:
MACE http://chipexo.sourceforge.net/
and GEM http://www.psrg.csail.mit.edu/gem/

Since MACS (version 1.4) was pre-installed in the HPC at UFL, I decided to give it a try

[mtang@dev1 mm10]$ module load macs

macs -t ChIP.bam -c control.bam -f BAM -g mm -n ChIP-exo -B

it took some time to finish the process (~30mins).  with -B flag, I want to output the bedgraph files for each chromosomes. If you specify -S, it will give you a single bedgraph file for the whole genome.

Look at the model built by MACS


You can have a quick look at the data by loading the bedgraph files into IGV. I just pick VEGFa to check




The peaks look very specific and sharp.

Then, I can find all the genes that contain a peak nearby.
Homer annotatePeaks http://biowhat.ucsd.edu/homer/ngs/annotation.html
if you use R, ChIPpeakAnno http://www.ncbi.nlm.nih.gov/pubmed/20459804
cistrome can do it very easily http://cistrome.org

BETA-minus: Targets prediction with binding only Predict the factors (TFs or CRs) direct target genes by only binding data
CEAS http://liulab.dfci.harvard.edu/CEAS/  also from Liu's lab
the easiest way PAVIS http://manticore.niehs.nih.gov:8080/pavis/

Monday, November 11, 2013

So you want to be a computational biologist?

See a commentary here : http://www.nature.com/nbt/journal/v31/n11/abs/nbt.2740.html

Quotes from the article:
“Laboratory scientists wouldn’t
dream of running experiments
without the necessary positive
and negative controls... tests
are the computational biology
equivalent.”

“Knowledge of biology is
vital in the interpretation of
computational results.”


Learning resources:




Friday, November 1, 2013

DESeq work flow

I was working on a human RNA-seq data set, and I already got the count matrix by HTSeq.
see my previous post here http://crazyhottommy.blogspot.com/2013/10/rna-seq-analysis-samtools-sort-and.html

Now, I am ready to use DESeq to detect the differentially expressed genes.
I followed the nature protocol from Simon Anders and here http://dwheelerau.com/2013/04/15/how-to-use-deseq-to-analyse-rnaseq-data/
for more details read the DESeq manual.



several out put
> head(countsTable)
          a1  b1  a2 b2  a3 b3
0R7H2P    21   9   3  6   1  4
5S_rRNA   55  87  56 56  34 40
7SK      171  89  61 84  57 47
A1BG      15  34  38 10  12 23
A1BG-AS1  65  49 140 41  40 44
A1CF     293 109 417 90 181 84

> head(counts(cds,normalized=TRUE))
                a1        b1         a2        b2         a3
0R7H2P    14.44370  7.348284   2.360913  5.935807   1.583712
5S_rRNA   37.82875 71.033416  44.070367 55.400869  53.846220
7SK      117.61301 72.666368  48.005221 83.101303  90.271605
A1BG      10.31693 27.760186  29.904892  9.893012  19.004548
A1BG-AS1  44.70670 40.007326 110.175917 40.561350  63.348494
A1CF     201.52404 88.995889 328.166838 89.037110 286.651937
                 b3
0R7H2P     5.492968
5S_rRNA   54.929684
7SK       64.542378
A1BG      31.584568
A1BG-AS1  60.422652
A1CF     115.352336
dispersion plot


MAplot: there are not many genes that are differentially expressed. Usually you should see red dots around the edge of the cloud. 








pvalue histogram, there are not many genes with a low p-value, usually you see a spike near x=0




PCA plot. PCA does not look great, one replicate for each cell type varies too much with the other two replicates, and it looks the same situation from the original PCA figure in the paper below.


figure2 from the original paper http://www.jci.org/articles/view/66514/figure/2




differentially expressed genes  according to figure 2 ( I read the supplement material, those genes are with more than 2 fold changes)

alpha cell specific: IRX2, PCSK2, DPP4,IRX1,GCG, PTPRD, ARX, HNF1A
beta cell specific: MAFA, GLP1R, PCSK1, PDX1, NKX6-1, CDKN1C, KCNQ2, INS, SLC30A8, HDAC9, IAPP, KCNJ11
My DESeq results:
> resSig[resSig$id=="GCG",]
       id baseMean baseMeanA baseMeanB foldChange log2FoldChange        pval padj
15096 GCG 127322.8  245014.6  9630.952 0.03930766      -4.669046 0.008959299    1
similarly I got others:
25252 PCSK2 1605.296  2448.875  761.7172  0.3110478      -1.684792 0.01701203    1
17931 IRX2 437.8337  764.6114  111.0561  0.1452451      -2.783438  0.00364283    1       
12803 DPP4  418.095  755.6109  80.57906  0.1066409      -3.229167  0.01849249    1
6374 ARX  104.137  166.6132  41.66077  0.2500449        -1.999741  0.01581663    1
        id baseMean baseMeanA baseMeanB foldChange log2FoldChange        pval padj
20260 MAFA 802.9027  156.4095  1449.396   9.266673       3.212052  0.04860831    1
15303 GLP1R 265.7284  172.2726  359.1841   2.084975        1.06003 0.05890757    1
25250 PCSK1 6597.262  2279.658  10914.87   4.787941       2.259405 0.09637812    1
8682 CDKN1C 581.1144  198.1623  964.0666   4.865035       2.28245 0.009237992    1
18299 KCNQ2 141.9849   86.3116  197.6583   2.290055       1.195382  0.0422248    1
17827 INS 7444.435  1668.456  13220.41    7.92374       2.986181  0.002803966    1
16172 HDAC9 527.0126   260.651  793.3742   3.043818      1.605882 0.004025324    1
17099 IAPP 6943.366  2072.604  11814.13   5.700136       2.510996  0.06844496    1
I only missed IRX1,PTPRD and HNF1A in alpha cells
and missed PDX1, SLC30A8 and KCNJ11 in beta cells.
> res[res$id=="IRX1",]
        id baseMean baseMeanA baseMeanB foldChange log2FoldChange  pval padj

17929 IRX1 115.7047  134.3566  97.05282  0.7223525      -0.469225 0.4556522    1

I checked the other missed ones, they are all with big p-values.  I do note that the P-adj values are all very big, that may due to the big variation among the replicates. The original paper used the FPKM rather than raw counts to detect the differentially expressed genes. However, considering the large overlapping gene list, I am pretty satisfied with my analysis. further reading counts VS FPKM http://www.cureffi.org/2013/09/12/counts-vs-fpkms-in-rna-seq/.I may try to use cuffdiff to do the same analysis some time later.
Alternatively, you can upload your raw count matrix to DEG-Vis http://www.vicbioinformatics.com/dge-vis/index.html to perform similar analysis without the knowledge of R. DEG-Vis uses EdgeR and limma internally.