Welcome to the NGS master class!

We'll use this Etherpad to promote discussion throughout the day. We'll post links and code, and also answer any questions you may have. Please sign in your name at the top right.


Class website: http://swcarpentry.github.io/2014-04-14-pycon-ngs/

digital normalization paper:
http://arxiv.org/abs/1203.4802


See answers to your questions below:

The tutorial link is
http://swcarpentry.github.io/2014-04-14-pycon-ngs/links.html

Illumina sequencing video: https://www.youtube.com/watch?v=HMyCqWhwB8E

Noisy splicing: http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1001236

If you have questions in the future, there are forums available:
SeqAnswers: http://seqanswers.com/
BioStars: https://www.biostars.org/

Homology: https://en.wikipedia.org/wiki/Homology_%28biology%29
Homologs are two genes which are evolutionarily related.
Orthologous genes are similar genes in two or more species.
Paralogous genes are similar genes that are in the same species.

Blogs:
http://www.homolog.us/
http://gettinggeneticsdone.blogspot.ca/
http://ivory.idyll.org/blog/
https://bcbio.wordpress.com/
http://simplystatistics.org/

Other resources:
https://en.wikibooks.org/wiki/Next_Generation_Sequencing_%28NGS%29
http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
http://bedtools.readthedocs.org/en/latest/
http://samtools.sourceforge.net/
http://www.digitalbiologist.com/2013/06/python-next-gen-sequencing.html
Compute Canada: https://computecanada.ca/en/

User problems:

I encountered the following problem

The program 'curl' is currently not installed.  You can install it by typing:
sudo apt-get install curl

I encountered a problem trying to figure out where within SSH to load the key --> It's within "Auth" not XII. I am updating the original doc accordingly.

Also, at least 2 of us got the super creepy Matrix screen after running curl so I think it's normal?

I had the super creepy matrix screen too, but I restarted PUtty and did a capital "O" instead of 0 and then it worked


Titus: This may be a solution to the last command thing:


mkdir /mnt/data
ln -fs /mnt/data /data
cd /data
curl -O http://public.ged.msu.edu.s3.amazonaws.com/mrnaseq-subset.tar
tar xvf mrnaseq-subset.tar

echo "You are Done"

Dropbox alternative:
Port forwarding necessary as AWS machine is locked down (ssh in from local machine):
    $ ssh -i ... -L 8000:127.0.0.1:8000 ubuntu@...

(Mac only, for putty - http://www.cs.uu.nl/technical/services/ssh/putty/puttyfw.html )

Make fastq analysis in /mnt/fastqc instead of /Dropbox/fastqc

Then on remote machine:
    $ cd /mnt
    $ python -m SimpleHTTPServer 8000

You can then open http://localhost:8000 on local machine to view files.

Questions and Answers:

-Why are we donwloading all these programs?

There are many bioinformatics tools available to solve common problems like the analyses we are running today. In order to be DRY (https://en.wikipedia.org/wiki/Don%27t_repeat_yourself), we want to use the mature software created by others so that we can focus on the custom aspects of our data analysis. We downloaded the following software:

khmer - http://khmer.readthedocs.org/en/v1.0/
          - de novo sequence assembly and some support tasks
FastQC - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
            - quality control visualizations
Trimmomatic - http://www.usadellab.org/cms/?page=trimmomatic
                    - Remove adapter sequeces
FASTX - http://hannonlab.cshl.edu/fastx_toolkit/
           - Trim reads to remove low quality bases

So then you can think about a pipeline like this:

1. Visually inspect data to find low quality issues, e.g. low quality base calls and contamination with adapter sequences.
2. Remove adapter sequences with Trimmomatic.
3. Remove low quality bases with FASTX.
4. Assemble the transcriptome using khmer.

-What are the commands we are using to download and install the software?

You can learn about most commands by running `man cmd` or `cmd --help`, where cmd is the name of the command.

curl - downloads files
tar - extracts files from a tarball

-What about Galaxy? 

http://galaxyproject.org/

Galaxy is a great bioinformatics project that aims to enable reproducible analyses. The main downside is that it is harder to automate, customize, and perform heavy computing (e.g. parallelization). If you have a smaller project, e.g. 3 cases and 3 controls of RNA-seq data, and you want to perform a traditional differential expression analysis, Galaxy will work well. If you have many samples and are performing more customized analyses, you'll need to learn how to run these tools from the command line.

-What is the best way to slice bams from 1000 genomes to create a subsample?

I would recommend using samtools. For example, if you only want the reads that map to chromosome 1 from a file called data.bam, you could run the following command: `samtools view -b data.bam chr1 > data_chr1.bam`.

http://samtools.sourceforge.net/

-What machine would you recommend?

If you are going to analyze large data sets on a regular basis, you will need to find a compute cluster to use. If you are a student, your university will likely have one. The actual machine is less important than having the support staff that can help you obtain a login and direct you to documentation on how to submit jobs.
t


- What about the AWS bid mechanism?

AWS, Amazon’s Web Services consists of a large number of different computer resources. Most of the services consist of either some form of storage or computation. EC2 is their defacto computational service. EC2 comes in several flavors. Notably, there is the On-Demand and Spot options. With EC2 On Demand, you pay the rate set by Amazon, but are guaranteed that the computer with run when you want and for as long as you want. The EC2 Spot instance is more flexible. You may bid on the price, but there is no guarantee as to when the computer will run. This means that, for very long an intensive tasks, it becomes possible to pay far less, at the cost of (theoretically temporary) interruptions, which occur when you are outbid or when EC2 On Demand instances are run, as they take precedence of EC2 Spot instances.

John: My recommendation would be to inquire about computing resources at your university/company. They may already have resources available that would save you from having to set up your computing environment on your own.

http://ivory.idyll.org/blog/what-is-diginorm.html What is digital normalization?

Etienne Question: I'm unable to install khmer on my own server. Any idea for the error: 
/usr/lib64/python2.6/site-packages/khmer-1.0_dirty-py2.6-linux-x86_64.egg/khmer/_khmermodule.so: undefined symbol: gzopen?

Etienne, it is probably an issue with zlib. I have trouble all the time with zlib/gzopen on the cluster I use as well. Do you have a sys admin that you could ask for help? (I'm the sysadmin :) ) Haha. Well then I can't help you. If it makes you feel better, our sys admin hasn't been able to solve all our zlib problems either. I can't use the XML package for R because of unresolved zlib issues. Ok, perfect. Thanks




 Is there a good review paper that summarizes the theory and gives a good overview of current  tools out there?
Maybe one of us can write it? ;) I've been looking it up and I've just found blogposts on individual sub-topics but no over-arching grand review.

This is difficult because every tool has its strengths and weaknesses. And often they trade speed and accuracy. Here are some relevant papers:

Comparison of differential expression software: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3608160/?tool=pmcentrez&report=abstract
Comparison of short read mappers: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2712089/
Differential expression overview: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3046478/
Heng Li (developer of BWA) summary: http://lh3lh3.users.sourceforge.net/NGSalign.shtml
RNA sequencing protocol: http://www.ncbi.nlm.nih.gov/pubmed/21863485
ChIP-seq guidelines: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/

-How do you keep track of your analyses (when you don't have a tutorial as a guide)?

This question has many answers depending on how much of your research is spent doing bioinformatics. If you just do a small analysis every once in a while, you should take notes just like you would if you were doing an experiment. Provide enough details such that you (or someone else) could recreate the results following your directions. In computing, it is traditional to have a README file that explains how to run the code.

(As a side note, I found this doc as a really useful resource for this question: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000424)+1

So let's say you want to document the analysis you ran today. You could copy paste all the code into a file called, .e.g., analysis.sh. You would also want to add a lot of comments that explain each step (bash ignores anything after the pound sign, #). Then in the README, you could right something like, "To recreate this analysis, run `bash analysis.sh`."

Personally, I (John) also have an electronic blog. Thus after I look at the FastQC results, for example, I upload some of the figures into a blog post and add my interpretation. Then I can return to it months later and remember what I thought about the data quality. Feel free to talk to me about this.

When you start to get really advanced, you can start to organize your pipelines into Make files that will automatically run only the parts of the analysis that you have changed.

Make: https://en.wikipedia.org/wiki/Make_%28software%29
snakemake: https://bitbucket.org/johanneskoester/snakemake/wiki/Home

Has anyone ever heard of Knime (http://www.knime.org/) ? It seems to be a good graphical tool for chaining together various tools for data analysis pipelines. It's a graphical data analysis workflow creation platform.



CLC Workbench?  http://www.clcbio.com/products/clc-main-workbench/
Seal Sea Workbench?

How is curl different from wget? The end result is the same. There are probably use cases for prefering one over the other, but for the simple downloads we have done in this tutorial, wget would be just the same. For the specific details: http://daniel.haxx.se/docs/curl-vs-wget.html

Etienne: Why no do a http://www.vagrantup.com/ vagrant file to setup everything at once (one time)?

If you know vagrant, that is a great idea! :-) Most biologists are not familiar. Also, installing one at a time is more realistic as you will likely install software as you need it. Only after you become experienced will you know the full list of software you will utilize.