Brian J. Knaus

banner

Short read toolbox

Here I hope to provide the sockets, extenders and universal joints to help you put together your genomics project.


Operating systems

There are some obvious differences between operating systems, and some not so obvious. For example, text files in Windows and Linux have different carriage returns. This makes a seemingly simple task - moving text files from one computer to another - into a frustratingly difficult task to troubleshoot. I usually work in a Linux environment. It never hurts to use dos2unix or mac2unix if you're moving files among computers. If they're not installed on your machine, google them to find a download.

File formats

Before one can work on data, they need to know what file formats they have and thus, what sort of data they have. Here are some examples.

Qseq file format

Qseq files aren the output of the Bustard process in the Illumina pipeline (now Casava). These files contain the sequence, corresponding qualities, as well as lane, tile and X/Y position of clusters.

Example qseq file: http://brianknaus.com/software/s_1_1_qseq_40.txt.

Scarf (Solexa compact ASCII read format) file format

Scarf files can also be output from the pipeline, very similar to qseq format.

Example scarf file: http://brianknaus.com/software/s_1_40_sequence.txt.

Fastq file format

Fastq files contain sequence data and quality data. The quality data is usually not in a human readable format, but maintains a character to character association with the sequence data. Quality data can be readily translated to a human readable probability of correct nucleotide assignment, but this data becomes more than one character in length.

Cock et al. 2009 FASTQ format definition.
http://en.wikipedia.org/wiki/FASTQ_format Fastq definition at Wikipedia.
http://maq.sourceforge.net/fastq.shtml Fastq definition at MAQ.

Example fastq file: s_4_1_sequence80.txt.

Fasta file format

Fasta files contain sequence data and represent a subset of fastq files (i.e., fasta files can be made from fastq files).
http://en.wikipedia.org/wiki/FASTA_format Fasta definition at Wikipedia.
http://www.ncbi.nlm.nih.gov/blast/fasta.shtml Fasta definition at NCBI.

Example fasta output file: ACGT_seq40.fa.


File format conversion

Utilities to get your data into the format you need.

qseq2fastq: qseq2fastq.pl.
scarf2fastq: scarf2fastq.pl.
fastq2fasta: fastq2fasta.pl (assumes sequence is on one line).

Sort fastq

Sorts fastq files into two files based on whether they hit or miss a reference (i.e., the adapter or other sequence of interest). Currently uses blat (http://genome.ucsc.edu/index.html?org=Human&db=hg19&hgsid=149424502) to determine hits or misses.

Current version of sort_fastq: sort_fastq.pl

fastq_btrim

The quality scores in the Illumina fastq files are an ASCII representation of the quality of the base call. Recently (as of 08-01-2010) the character 'B' (upper case B) has been used to represent unreadable nucleotides. Once an unreadable nucleotide has been met, all nucleotides downstream of this nucleotide are also unreadable. Experience indicates that around 33% of our reads are affected by this phenomenon. Yet, these nucleotides appear as called nucleotides in the sequence portion of the fastq file. This script helps mitigate the problem of unreadable nucleotides in an Illumina read. The reads can either be trimmed so that only the readable portion of the read remains (resulting in a file of heterogeneous read lengths) or the read can be filled with 'N's begining with the first 'B' quality while the qualities are preserved (this facilitates subsequent trimming).

Current version of fastq_btrim: fastq_btrim.pl

Fastq2qbarplot

Creates barplots of average quality per cycle.

Current version of fastq2qbarplot.pl: fastq2qbarplot.pl
Example output: s_1_sequence_miss_qualities.png

Requires: Statistics::R

Nuccomp

Calculates the per cycle nucleotide composition.

Current version of nuccomp.pl: nuccomp.pl
Example output: s_1_sequence_miss_nuccomp.png

Requires: Statistics::R

Bcsort

Bcsort sorts Illumina short reads by barcode. Input is fastq format, as output by the 'Gerald' process ('N' is the only allowed ambiguous nucleotide). This software removes the barcode and outputs to fastq and fasta format. Execution without options returns information on usage.

This software is agnostic to memory (i.e., better have GBs of RAM), but runs relatively fast (i.e., minutes per millions of reads). It will return an error if it runs out of memory, so user beware.

Due to the rapid rate of change in the Illumina chemistry and software these scripts change frequently (currently, every time I sort barcodes I change something). Feel free to contact me for the latest information. And if you use this software, please send me an e-mail to let me know how my software performed (or mis-performed).

Current version of bcsort for single-end data: bcsort_se.pl
Current version of bcsort for paired-end data: bcsort_pe.pl

Example barcodes file (tab delimited): barcodes.txt Note that the user is responsible for adding the quality control nucleotide. In the past this has meant a 'T' immediately 3' of the barcode (as in the example). This appears to be changing.


Suggested citation

Knaus, B.J. 2010. Short read toolbox. http://brianknaus.com.



Back to BrianKnaus.com
Copyright (c) 2010 Brian J. Knaus, all rights reserved.