Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Monday, January 26, 2015

use Entrez Direct to access NCBI database

I was reading Applied Bioinformatics 2014 lecture 3 and learned that one can use Entrez Direct to access NCBI database (Pubmed, nucleotide, protein sequence etc).

After installing Entrez Direct, I played around with it:
# search pubmed contains "glioblastoma enhancer"
$esearch -db pubmed -query "glioblastoma enhancer"
<ENTREZ_DIRECT>
<Db>pubmed</Db>
<WebEnv>NCID_1_539964707_130.14.18.34_9001_1422280320_2091337226_0MetA0_S_MegaStore_F_1</WebEnv>
<QueryKey>1</QueryKey>
<Count>97</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
# search pubmed with title contains "glioblastoma enhancer" returned 0 count
$esearch -db pubmed -query "glioblastoma enhancer [TITL]"
<ENTREZ_DIRECT>
<Db>pubmed</Db>
<WebEnv>NCID_1_23683635_130.14.22.215_9001_1422280849_1465220088_0MetA0_S_MegaStore_F_1</WebEnv>
<QueryKey>1</QueryKey>
<Count>0</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
#fetch the abstract
$esearch -db pubmed -query "glioblastoma enhancer" | efetch -format abstract > glioblastoma.txt
#check the abstracts
$ less -S glioblastoma.txt
# how many papers?
$cat glioblastoma.txt | grep PMID | wc -l
97
# fetch the protein sequences of human CTCF
$esearch -db protein -query "Homo sapiens [ORGN] AND CTCF[GENE]" | efetch -format fasta > CTCF_protein.fa
# fetch the nucleotide sequences of human CTCF
$esearch -db nucleotide -query "Homo sapiens [ORGN] AND CTCF[GENE]" | efetch -format fasta > CTCF_nucleotide.fa
# in genebank format
$esearch -db nucleotide -query "Homo sapiens [ORGN] AND CTCF[GENE]" | efetch -format gb > CTCF_nucleotide.gb
# From a biostar post https://www.biostars.org/p/92671/
#Given a Gene ID, download the aminoacid sequences of the corresponding Proteins, keeping only the reviewed entries (e.g. no putative, predicted sequences):
$esearch -db gene -query "1234[id]" | elink -target protein | efilter -query "REVIEWED[FILTER]"| efetch -format fasta
#Given a file containing a list of Gene IDs (one per line), download all the entries in tabular format:
$esearch -db gene -query $(paste -s -d ',' mygenes.ids) | efetch -format tabular > mygenes.details.txt

Commonly-used fields for PubMed queries include:
  [AFFL]  Affiliation       [FILT]  Filter              [MESH]  MeSH Terms
  [ALL]   All Fields        [JOUR]  Journal             [PTYP]  Publication Type
  [AUTH]  Author            [LANG]  Language            [WORD]  Text Word
  [FAUT]  Author - First    [MAJR]  MeSH Major Topic    [TITL]  Title
  [LAUT]  Author - Last     [SUBH]  MeSH Subheading     [TIAB]  Title/Abstract 
[PDAT] Date - Publication [UID] UID

Filters that limit search results to subsets of PubMed include:
  humans [MESH]                has abstract [FILT]
  pharmacokinetics [MESH]      historical article [FILT]
  chemically induced [SUBH]    loprovflybase [FILT]
  all child [FILT]             randomized controlled trial [FILT]
  english [FILT]               clinical trial, phase ii [PTYP]
  free full text [FILT]        review [PTYP]
Sequence databases are indexed with a different set of search fields, including:
  [ACCN]  Accession       [GENE]  Gene Name            [PROT]  Protein Name
  [ALL]   All Fields      [JOUR]  Journal              [SQID]  SeqID String
  [AUTH]  Author          [KYWD]  Keyword              [SLEN]  Sequence Length
  [GPRJ]  BioProject      [MLWT]  Molecular Weight     [SUBS]  Substance Name
  [ECNO]  EC/RN Number    [ORGN]  Organism             [WORD]  Text Word
  [FKEY]  Feature Key     [PACC]  Primary Accession    [TITL]  Title
  [FILT]  Filter          [PROP]  Properties           [UID]   UID
and a sample query in the protein database is:
  "alcohol dehydrogenase [PROT] NOT (bacteria [ORGN] OR fungi [ORGN])"

Please refer to the documents for more examples http://www.ncbi.nlm.nih.gov/books/NBK179288/

4 comments:

  1. Hello, I am new to using Entrez Direct myself. Is there a filter/restriction I can set, to find all documents listed that have been published in the last 5 days?

    ReplyDelete
    Replies
    1. I am also new to it. According to the manual:

      Results can also be filtered by time. For example, the following statements:

      efilter -days 60 -datetype PDAT
      efilter -mindate 1990 -maxdate 1999 -datetype PDAT

      restrict results to articles published in the previous two months or in the 1990s, respectively.

      Delete
    2. thanks, I realized that 8 hours ago or so xD I overread it and nobody on the web seems to have written about it

      Delete
    3. No problem! Good luck with your research.

      Delete