Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Monday, January 26, 2015

use Entrez Direct to access NCBI database

I was reading Applied Bioinformatics 2014 lecture 3 and learned that one can use Entrez Direct to access NCBI database (Pubmed, nucleotide, protein sequence etc).

After installing Entrez Direct, I played around with it:
# search pubmed contains "glioblastoma enhancer"
$esearch -db pubmed -query "glioblastoma enhancer"
<ENTREZ_DIRECT>
<Db>pubmed</Db>
<WebEnv>NCID_1_539964707_130.14.18.34_9001_1422280320_2091337226_0MetA0_S_MegaStore_F_1</WebEnv>
<QueryKey>1</QueryKey>
<Count>97</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
# search pubmed with title contains "glioblastoma enhancer" returned 0 count
$esearch -db pubmed -query "glioblastoma enhancer [TITL]"
<ENTREZ_DIRECT>
<Db>pubmed</Db>
<WebEnv>NCID_1_23683635_130.14.22.215_9001_1422280849_1465220088_0MetA0_S_MegaStore_F_1</WebEnv>
<QueryKey>1</QueryKey>
<Count>0</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
#fetch the abstract
$esearch -db pubmed -query "glioblastoma enhancer" | efetch -format abstract > glioblastoma.txt
#check the abstracts
$ less -S glioblastoma.txt
# how many papers?
$cat glioblastoma.txt | grep PMID | wc -l
97
# fetch the protein sequences of human CTCF
$esearch -db protein -query "Homo sapiens [ORGN] AND CTCF[GENE]" | efetch -format fasta > CTCF_protein.fa
# fetch the nucleotide sequences of human CTCF
$esearch -db nucleotide -query "Homo sapiens [ORGN] AND CTCF[GENE]" | efetch -format fasta > CTCF_nucleotide.fa
# in genebank format
$esearch -db nucleotide -query "Homo sapiens [ORGN] AND CTCF[GENE]" | efetch -format gb > CTCF_nucleotide.gb
# From a biostar post https://www.biostars.org/p/92671/
#Given a Gene ID, download the aminoacid sequences of the corresponding Proteins, keeping only the reviewed entries (e.g. no putative, predicted sequences):
$esearch -db gene -query "1234[id]" | elink -target protein | efilter -query "REVIEWED[FILTER]"| efetch -format fasta
#Given a file containing a list of Gene IDs (one per line), download all the entries in tabular format:
$esearch -db gene -query $(paste -s -d ',' mygenes.ids) | efetch -format tabular > mygenes.details.txt

Commonly-used fields for PubMed queries include:
  [AFFL]  Affiliation       [FILT]  Filter              [MESH]  MeSH Terms
  [ALL]   All Fields        [JOUR]  Journal             [PTYP]  Publication Type
  [AUTH]  Author            [LANG]  Language            [WORD]  Text Word
  [FAUT]  Author - First    [MAJR]  MeSH Major Topic    [TITL]  Title
  [LAUT]  Author - Last     [SUBH]  MeSH Subheading     [TIAB]  Title/Abstract 
[PDAT] Date - Publication [UID] UID

Filters that limit search results to subsets of PubMed include:
  humans [MESH]                has abstract [FILT]
  pharmacokinetics [MESH]      historical article [FILT]
  chemically induced [SUBH]    loprovflybase [FILT]
  all child [FILT]             randomized controlled trial [FILT]
  english [FILT]               clinical trial, phase ii [PTYP]
  free full text [FILT]        review [PTYP]
Sequence databases are indexed with a different set of search fields, including:
  [ACCN]  Accession       [GENE]  Gene Name            [PROT]  Protein Name
  [ALL]   All Fields      [JOUR]  Journal              [SQID]  SeqID String
  [AUTH]  Author          [KYWD]  Keyword              [SLEN]  Sequence Length
  [GPRJ]  BioProject      [MLWT]  Molecular Weight     [SUBS]  Substance Name
  [ECNO]  EC/RN Number    [ORGN]  Organism             [WORD]  Text Word
  [FKEY]  Feature Key     [PACC]  Primary Accession    [TITL]  Title
  [FILT]  Filter          [PROP]  Properties           [UID]   UID
and a sample query in the protein database is:
  "alcohol dehydrogenase [PROT] NOT (bacteria [ORGN] OR fungi [ORGN])"

Please refer to the documents for more examples http://www.ncbi.nlm.nih.gov/books/NBK179288/

Friday, January 23, 2015

Install Inkscape on Mac

It has been a while since I wrote my last post. I am now in China and the Internet connection is very bad... I will start my postdoc in Dr.Role Verhaak's lab at MD Anderson Cancer Center. Dr. Verhaak's lab studies genomic alternations of brain tumor by analyzing whole exome sequencing, RNA-seq, whole genome sequencing, methylation and copy-number data. Yes, I am going to do a postdoc on computational biology. For sure, my computational skills would be strengthened in Dr.Verhaak's lab. Moreover, I will not give up my bench skills. I will use experiments to validate the functions of computational predictions.

I am writing a proposal and want to make a figure with Inkscape. It is a very good drawing software to deal with vector based figures, for bitmap based images, Gimp is the right one. On a linux machine (I have a ubuntu machine), inkscape can be installed by:
$sudo apt-get install inkscape


I only have a mac machine now at hand, so I have to install inkscape on my mac. Installation on mac is different from that on linux. Mac OS needs Xquartz to be installed first. I followed instructions here http://stackoverflow.com/questions/21049815/mavericks-trying-to-install-xquartz-x11-for-inkscape-image-not-found and here https://www.youtube.com/watch?v=7kvSc_PYokM

And now it is ready to use!  I am excited to start my new job at MD Anderson and I should have an exciting 2015. I will update more frequently on this blog as I learn new stuff in my new lab.