Diving into Genetics and Genomics: 2019

This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Friday, September 13, 2019

My opinionated selection of books/urls for bioinformatics/data science curriculum

There was a paper on this topic: A New Online Computational Biology Curriculum.
I am going to provide a biased list below (I have read most of the books if not all). I say it is biased because you will see many books of R are from Hadely Wickham. I now use tidyverse most of the time.

Unix

I suggest people who want to learn bioinformatics starting to learn unix commands first. It is so powerful and also omnipresent in high-performance computing settings (clouding computing etc). You can not survive without knowing it.

The linux command line
How Linux works
Data science at the command line It was a fun reading for me and learned many tricks from this book.
command line bootcamp interactive online session to learn unix. it is not working anymore unfortunately.

Computational biology

A Primer for Computational Biology by Shawn T. O’Neil
Practical computing for biologist by Steven H.D Haddock and Casey W. Dunn This was the first book that I used to learn unix, regex and python.
Bioinformatics data skills by Vince Buffalo. This is a must have! once you have some experience on bioinformatics.

R programming

R for data science by Garrett Grolemund and Hadley Wickham.
Advanced R by Hadley Wickham.
R packages by Hadley Wickham. If you want to transit from an R user to developer, writing an R package will get you started.

Stats (R focused)

Data analysis for the life science with R by Micheal Love and Rafael A. Irizarry. I took the course on edx for 3 times! learned a ton! You can buy a paper book at https://www.crcpress.com/Data-Analysis-for-the-Life-Sciences-with-R/Irizarry-Love/p/book/9781498775670
Computational Genomics with R by Altuna Akalin.
Mordern statistics for mordern biology by Susan Holmes and Wolfgang Huber.

Python programming

Machine learning

Visualization

Fundamentals of Data Visualization by Claus O.Wilke
The Visual Display of Quantitative Information by Edward R. Tufte as well.

Those two books are not teaching you how to make figures programmatically (although the book by Claus was generated by Rmarkdown and the codes for all the figures can be found at https://github.com/clauswilke/dataviz). They teach you what kind of figures are informative and pleasant to eyes. From data to viz is a website guiding you to choose the right graph for your data.

I am still using R/ggplot2 for visualization.

Data Visualization by Kieran Healy.
R Graphics Cookbook by Winston Chang.
ggplot2: Elegant Graphics for Data Analysis by Hadely Wickham.

Finally, I have compiled many useful links at https://github.com/crazyhottommy/getting-started-with-genomics-tools-and-resources .

What’s your favorite book that I have missed? Comment below!

Cross posted at https://divingintogeneticsandgenomics.rbind.io/post/my-opinionated-selection-of-books-for-bioinformatics-data-science-curriculum/ at my new blog

Tuesday, September 3, 2019

How to upload files to GEO

readings

links: http://yeolab.github.io/onboarding/geo.html
http://www.hildeschjerven.net/Protocols/Submission_of_HighSeq_data_to_GEO.pdf
https://www.ncbi.nlm.nih.gov/geo/info/submissionftp.html

1. create account

Go to NCBI GEO: http://www.ncbi.nlm.nih.gov/geo/ Create User ID and password. my username is research_guru

I used my google account.

2. fill in the xls sheet

Downloaded the meta xls sheet from https://www.ncbi.nlm.nih.gov/geo/info/seq.html

## bgzip the fastqs

cd 01seq
find *fastq | parallel bgzip
md5sum *fastq.gz > fastq_md5.txt 
# copy to excle
cat fastq_md5.txt | awk '{print $2}'

#copy to excle
cat fastq_md5.txt | awk '{print $1}'


cd ../07bigwig
#get the md5sum

md5sum *bw > bigwig_md5.txt

# sample name, copy to excel
cat bigwig_md5.txt | awk '{print $2}'

# md5, copy to excel
cat bigwig_md5.txt | awk '{print $1}'

cd ../08peak_macs1

md5sum *macs1_peaks.bed > peaks_md5.txt
# copy to excel
cat peaks_md5.txt | awk '{print $2}'

cd ..
mkdir research_guru_KMT2D_ChIPseq

cd research_guru_KMT2D_ChIPseq

## fill in the xls sheet and save in this folder

This is the most time-consuming/tedious step.

3. hard link peak and bigwig files to the folder

soft link does not work for me…

ln  /rsrch2/genomic_med/krai/hunain_histone_reseq/snakemake_ChIPseq_pipeline_downsample/07bigwig/*bw .


ln /rsrch2/genomic_med/krai/hunain_histone_reseq/snakemake_ChIPseq_pipeline_downsample/08peak_macs1/*macs1_peaks.bed .

4. upload to GEO


# inside the folder: research_guru_KMT2D_ChIPseq
ftp ftp-private.ncbi.nlm.nih.gov

## type in the user name and the password
## this is not your GEO account user name.
## everyone uses the same `geo` and the same password below.

# https://www.ncbi.nlm.nih.gov/geo/info/submissionftp.html

#name: geo
#password: 33%9uyj_fCh?M16H

ftp> prompt n
Interactive mode off.

ftp> cd fasp

# make a folder in the ftp site
ftp> mkdir research_guru_ChIPseq

ftp> cd research_guru_ChIPseq

#upload all the files
ftp> mput *

5. telling NCBI you uploaded stuff

After your transfer is complete, you need to tell the NCBI.

After file transfer is complete, please e-mail GEO with the following information: - GEO account username (tangming2005@gmail.com); - Names of the directory and files deposited; - Public release date (required - up to 3 years from now - see FAQ).

Side notes

for paired-end sequencing data. the xls sheet requires you to fill in the average insert size and the std.

picard CollectInsertSizeMetrics can do this job.

time java -jar /scratch/genomic_med/apps/picard/picard-tools-2.13.2/picard.jar CollectInsertSizeMetrics I=4-Mll4-RasG12D-1646-2-cd45_S40_L006.sorted.bam  H=4-Mll4-RasG12D-1646-2-cd45_S40_L006_insert.pdf  O=4-Mll4-RasG12D-1646-2-cd45_S40_L006_insert.txt

# finish in ~5mins

read http://thegenomefactory.blogspot.com/2013/08/paired-end-read-confusion-library.html for insert size definition.

Diving into Genetics and Genomics

My github papge