Brian J. Knaus

Short read toolbox

Here I hope to provide the sockets, extenders and universal joints to help you put together your genomics project.

Operating systems

There are some obvious differences between operating systems, and some not so obvious. For example, text files in Windows and Linux have different carriage returns. This makes a seemingly simple task - moving text files from one computer to another - into a frustratingly difficult task to troubleshoot. I usually work in a Linux environment. It never hurts to use dos2unix or mac2unix if you're moving files among computers. If they're not installed on your machine, google them to find a download.

Cassava v1.8

We run v1.8+ of the Illumina pipeline (as of July 2011). This upgrade marked a number of important changes. I currently make no attempt at backward compatibility. But I do provide a link to my older scripts (see below).

Changes in v1.8:

- Fastq files are now gzipped.

- The format of the header line for each record in the fastq.gz file has changed.

- Records are no longer filtered for chastity and purity. Instead a character in the header line identifies whether it has been filtered out ('Y') or not ('N').

- Fastq.gz files contain a maximum of 4 million records (an arbitrary and parameterized decision). This results in multiple files per lane.

Scripts from v1.3 are still available here. shortread_v13

File formats

Before one can work on data, they need to know what file formats they have and thus, what sort of data they have. Here are some examples.

Fastq file format

Fastq files contain sequence data and quality data. The quality data is usually not in a human readable format, but maintains a character to character association with the sequence data. Quality data can be readily translated to a human readable probability of correct nucleotide assignment, but this data becomes more than one character in length.

Cock et al. 2009 FASTQ format definition.
http://en.wikipedia.org/wiki/FASTQ_format Fastq definition at Wikipedia.
http://maq.sourceforge.net/fastq.shtml Fastq definition at MAQ.

Example fastq file (pre Cassava v1.8): s_4_1_sequence80.txt.

Fastq.gz file format

The community has done a commendable job at standardizing the fastq format! Unfortunately, Illumina has seen fit to change the definition in Cassava v1.8. In general I feel these are actually really good changes. The downside is that these changes require us to change yet once again.

Example fastq.gz file (Cassava v1.8; gunzipped): lane6_NoIndex_L006_R1_002.fastq.gz.

Fasta file format

Fasta files contain sequence data and represent a subset of fastq files (i.e., fasta files can be made from fastq files, but not the reverse).
http://en.wikipedia.org/wiki/FASTA_format Fasta definition at Wikipedia.
http://www.ncbi.nlm.nih.gov/blast/fasta.shtml Fasta definition at NCBI.

Example fasta output file: ACGT_seq40.fa.

File format conversion

Utilities to get your data into the format you need.

fastq2fasta: fastq2fasta.pl

Assumes each fastq record consists of four lines: a header, the sequence, a header and the sequence qualities.

Quality filtering

The Illumina pipeline currently provides all reads regardless of whether they have passed quality filtering (chastity and purity). Whether a read has passed filtering is instead embedded in the identifier line (begins with '@') of the fastq file. This means that if you want to only use the reads which have passed quality control you must filter the reads. These scripts filter the reads into a file where only reads which pass quality are present.

Current version of casava_filter for single-end data: casava_filter_se.pl.
Current version of casava_filter for paired-end data: casava_filter_pe.pl.

Bcsort

Bcsort sorts Illumina short reads by a prefixed barcode (each read begins with a defined barcode). Input is fastq format, where the only allowed nucleotides are A, C, G, T and N. This software removes the barcode and outputs to fastq and fasta format. Execution without options returns information on usage.

This software is agnostic to memory (i.e., better have GBs of RAM), but runs relatively fast (i.e., minutes per millions of reads). However, Casava v1.8 limits file sizes to 4 million records, which controls memory requirements. It is estimated that each 4 million record file will require 2 GB of RAM. Plan accordingly. The script will return an error if it runs out of memory... user beware.

Current version of bcsort for single-end data: bcsort_se.pl
Current version of bcsort for paired-end data: bcsort_pe.pl

Example barcodes file (tab delimited): barcodes.txt Note that the user is responsible for adding the quality control nucleotide. In the past this has meant a 'T' immediately 3' of the barcode (as in the example).

Got SGE?

The Sun Grid Engine (SGE) works between the user and the hardware (i.e., the computer). Instead of sending jobs to a node, after hopefully making sure those resources are available, in SGE a user submits jobs to a queue. It is then the job of the queue to find available resources to run these jobs. This script recursively finds files associated with one lane of Illumina data. It then generates outfile names to account for whether jobs have already been completed . Lastly, it automates submision of uncomplete jobs as an array job. Load can be managed by submitting the jobs with a hold status and/or setting the maximum number of concurrent jobs. Execute this script iteratively until it reports to the log that all jobs are done.

Current version of srlane.pl: srlane.pl

Nuccomp

Calculates the per cycle nucleotide composition.

Current version of nuccomp.pl: nuccomp.pl
Example output: s_1_sequence_miss_nuccomp.png

Ssr_sorter

Sorts Illumina paired-end sequences that contain expected microsatellites (ssrs) in the internal portion of each pair. Requires a tab-delimited text file containing the expected motifs.

Original citation:
Jennings, T.N., B.J. Knaus, T.D. Mullins, S.M. Haig and R.C. Cronn. 2011. Multiplexed microsatellite recovery using massively parallel sequencing. Molecular Ecology Resources 11:1060-1067. Link to article at Wiley

Current citation:
The above citation has been deprecated by a more current project:
Miller, Mark P., Knaus, Brian J., Mullins, Thomas D., Haig, Susan M. 2013. SSR_pipeline: A Bioinformatic Infrastructure for Identifying Microsatellites From Paired-End Illumina High-Throughput DNA Sequencing Data. Journal of Heredity. Link

Suggested citation

Knaus, B.J. 2014. Short read toolbox. http://brianknaus.com.