RNASeq analysis for differential expression in GenePattern
last edit, September 13 2016
[ Main_Page | NGS_data_analysis ]
BITS will install GenePattern servers in Leuven and Gent to support VIB scientists with the analysis of their RNA-Seq data. The GenePattern server in Leuven is running on the cluster of the VSC (Vlaams Supercomputer Centrum).
This wiki page therefore contains a list of how-tos describing different actions on the GenePattern server and different steps in the RNA-Seq analysis workflow.
Contents
- 1 How-tos
- 1.1 Obtain a VSC account
- 1.2 Access the GenePattern server
- 1.3 Upload a file from your computer to our GenePattern server
- 1.4 Create a subfolder in the uploads folder
- 1.5 Find a tool in GenePattern
- 1.6 Run a tool in GenePattern
- 1.7 Parameters of the tools
- 1.8 Define the input file of a tool in GenePattern
- 1.9 Processing the output of a tool
- 1.10 Identification of DE genes in R
- 2 The training
How-tos
Obtain a VSC account
Access the GenePattern server
Upload a file from your computer to our GenePattern server
How to upload files from your local hard drive to GenePattern ? |
---|
* Go to the Files tab
|
Create a subfolder in the uploads folder
How to create a subfolder in the uploads folder on the Files tab ? |
---|
* Click uploads. A window will appear where you can choose to create a subfolder, upload files or upload a folder.
|
Find a tool in GenePattern
Tools in GenePattern are called modules.
How to find a tool ? |
---|
You can find a module by typing its name into the search box on the Modules tab:
|
Run a tool in GenePattern
How to run a tool ? |
---|
|
Parameters of the tools
Parameters of Groomer
Groomer cleans messy fastq files and transforms them into standard fastq format.
To obtain a description of its parameters and their default values click the Documentation link at the top of the page.
One of the parameters you need to define is Input format: the encoding of the fastq file you want to clean. The encoding is important because it determines the offset of the quality scores (ASCII offset 33 or ASCII offset 64). If you're not sure you can check the encoding of your file in the FastQC report (take into account that FastQC sporadically makes the wrong guess).
Parameters of FastQC
FastQC generates a report describing the quality of the reads of a fastq file.
To obtain a description of its parameters and their default values click the Documentation link at the top of the page.
Parameters of Trimmomatic
Parameters of TopHat
Parameters of STAR
Parameters of Kallisto
- Strand specificity: used for counting, none counts on both strands, forward runs kallisto in strand specific mode, only fragments where the first read in the pair pseudoaligns to the forward strand of a transcript are counted. If a fragment pseudoaligns to multiple transcripts, only the transcripts that are consistent with the first read are kept, reverse same as forward but the first read maps to the reverse strand of a transcript. The default for strand specificity is none.
- Average fragment length: In the case of single-end reads, this parameter must be used to specify the average fragment length. Typical Illumina libraries produce fragment lengths ranging from 180–200 bp but it’s best to determine this from a library quantification with an instrument such as an Agilent Bioanalyzer. For paired-end reads, the average fragment length can be directly estimated from the reads and the program will do so.
Kallisto produces the following output file:
abundances.tsv is a plaintext file of the abundance estimates. It is not in the right format to be read by sleuth. It is therefore better to analyze it further with EdgeR or DESeq2. The first line contains a header for each column, including estimated counts (to be used by DESeq2), TPM (Transcripts per million: do not use this for DE analysis with DESeq2), effective length.
Parameters of RSEQC
Parameters of Picard
Parameters of HTSeq-count
- Input format: you have to specify the format of your input file: bam or sam.
- Strandedness: none counts on both strands (= total count), forward only counts reads that map to in the same strand as the gene, reverse only counts reads that map to the opposite strand as the gene. The default for strandedness is yes. If your RNA-Seq data has not been made with a strand-specific protocol, this causes half of the reads to be lost. Hence, make sure to set this to none unless you have strand-specific data!
- Min qual: minimum mapping quality (phred score) to count a read.
- Mode: specifies how stringent the counting should be with intersec_strict as the most stringent option and union as the least stringent. See slides and documentation for a detailed explanation.
Interpretation of the output:
Output is a table with counts for each gene. At the bottom you find the special counters: which count reads that were not counted for various reasons. The names of the special counters all start with a double underscore. The special counters are:
- __no_feature: reads (or read pairs) which could not be assigned to any gene.
- __ambiguous: reads (or read pairs) which could have been assigned to more than one gene and hence were not counted for any of these.
- __too_low_aQual: reads (or read pairs) which were skipped due to the min qual setting.
- __not_aligned: reads (or read pairs) in the bam file without alignment.
- __alignment_not_unique: reads (or read pairs) with more than one reported alignment.
Define the input file of a tool in GenePattern
Load a file from your Files tab
How to use a file from your Files tab as input of a tool ? |
---|
Just drag and drop the file from the Files tab in the area that is labeled Drag Files Here |
We have created a shared folder on the server where we can put files that can be accessed by everyone. This folder appears as a subfolder of the Uploads folder.
Upload the input file from your compüter
Click the Upload File button. However, since we work with large input files it's better to upload the files before you start the analysis (see section on uploading files for details) and load them from the Files tab. That's way faster than having to upload the file each time you want to use it in a tool.
Upload a file from the internet
How to use a file on the internet as input of a tool ? |
---|
Click the Add Path or URL button and provide the URL of the file |
Processing the output of a tool
View the content of a results file
You can view the results of a tool in your browser
How to view the results of a tool ? |
---|
When a tool has finished output files are generated at the bottom of the page. Click the name of the output file.
Output files that are not saved in the uploads folder are stored 7 days on the server and are visible via the Jobs tab. So if you want to store the output of a tool that ran a few days ago, go to the Jobs tab and click the name of the output file.
Select Open Link
|
Store the output of a tool on the Files tab
You can store the output of a tool in your uploads folder so that you can use as input for other tools later on.
How to store the output of a file in the uploads folder on the Files tab ? |
---|
When a tool has finished output files are generated at the bottom of the page. Click the name of the output file.
Output files that are not saved in the uploads folder are stored 7 days on the server and are visible via the Jobs tab. So if you want to store the output of a tool that ran a few days ago, go to the Jobs tab and click the name of the output file.
Select Copy to Files Tab
|
Save the results file to your computer
How to store the output of a file on your computer ? |
---|
When a tool has finished output files are generated at the bottom of the page. Click the name of the output file.
Output files that are not saved in the uploads folder are stored 7 days on the server and are visible via the Jobs tab. So if you want to store the output of a tool that ran a few days ago, go to the Jobs tab and click the name of the output file.
Select Save File
|
Identification of DE genes in R
See this page for the workflow in R.
The training
Aims of the RNA-Seq analysis training
Use publicly available Illumina RNA-Seq data to:
- Walk through a complete analysis workflow including read QC, read mapping to the human hg19 genome, and counting the reads mapping to each human transcript.
- Perform differential expression analysis using popular Bioconductor tools and obtain data ready for functional analysis and biological interpretation.
The skills acquired during this session should allow reproduction of the workflow.
Not covered during this training
This RNA-Seq training will fit the needs of scientists who wish to compare experimental conditions to identify differentially expressed genes. Other RNA-Seq applications including the analysis of gene regulation at the chromatin level, by ncRNA or miRNA are not covered here.
- This protocol is not valid to analyze time series experiments
- This protocol is not valid for dose response analysis
- This protocol is not valid to analyze survival effects associated with gene expression
- This protocol does not de-novo assemble reads but instead maps reads to a known reference (it is therefore not applicable to organisms for which no transcriptome is available)
- Although this workflow uses transcript information during the mapping, this protocol does not consider splice variants and maps all reads to gene models
- The training stops with the definition of differentially expressed genes and does not include downstream processing of the results at the functional level. For this, please refer to follow-up sessions covering the use of Ingenuity Pathway Analysis or related tools.
Prerequisites
Skills required to follow this training:
- Basic knowledge of human genome and transcriptome structure
- Basic knowledge of Illumina NGS read structure
All attendees who lack knowledge about Illumina reads should follow the preliminary one day 'Introduction to the analysis of NGS data' training ([1]).