Here I hope to provide the sockets, extenders and universal joints to help you put together your genomics project.
There are some obvious differences between operating systems, and some not so obvious. For example, text files in Windows and Linux have different carriage returns. This makes a seemingly simple task - moving text files from one computer to another - into a frustratingly difficult task to troubleshoot. I usually work in a Linux environment. It never hurts to use dos2unix or mac2unix if you're moving files among computers. If they're not installed on your machine, google them to find a download.
Before one can work on data, they need to know what file formats they have and thus, what sort of data they have. Here are some examples.
Qseq files aren the output of the Bustard process in the Illumina pipeline (now Casava). These files contain the sequence, corresponding qualities, as well as lane, tile and X/Y position of clusters.
Example qseq file: http://brianknaus.com/software/s_1_1_qseq_40.txt.
Scarf files can also be output from the pipeline, very similar to qseq format.
Example scarf file: http://brianknaus.com/software/s_1_40_sequence.txt.
Fastq files contain sequence data and quality data. The quality data is usually not in a human readable format, but maintains a character to character association with the sequence data. Quality data can be readily translated to a human readable probability of correct nucleotide assignment, but this data becomes more than one character in length.
Cock et al. 2009 FASTQ format definition.
http://en.wikipedia.org/wiki/FASTQ_format Fastq definition at Wikipedia.
http://maq.sourceforge.net/fastq.shtml Fastq definition at MAQ.
Example fastq file: s_4_1_sequence80.txt.
Fasta files contain sequence data and represent a subset of fastq files (i.e., fasta files can be made from fastq files).
http://en.wikipedia.org/wiki/FASTA_format Fasta definition at Wikipedia.
http://www.ncbi.nlm.nih.gov/blast/fasta.shtml Fasta definition at NCBI.
Example fasta output file: ACGT_seq40.fa.
Utilities to get your data into the format you need.
qseq2fastq: qseq2fastq.pl.Sorts fastq files into two files based on whether they hit or miss a reference (i.e., the adapter or other sequence of interest). Currently uses blat (http://genome.ucsc.edu/index.html?org=Human&db=hg19&hgsid=149424502) to determine hits or misses.
Current version of sort_fastq: sort_fastq.plThe quality scores in the Illumina fastq files are an ASCII representation of the quality of the base call. Recently (as of 08-01-2010) the character 'B' (upper case B) has been used to represent unreadable nucleotides. Once an unreadable nucleotide has been met, all nucleotides downstream of this nucleotide are also unreadable. Experience indicates that around 33% of our reads are affected by this phenomenon. Yet, these nucleotides appear as called nucleotides in the sequence portion of the fastq file. This script helps mitigate the problem of unreadable nucleotides in an Illumina read. The reads can either be trimmed so that only the readable portion of the read remains (resulting in a file of heterogeneous read lengths) or the read can be filled with 'N's begining with the first 'B' quality while the qualities are preserved (this facilitates subsequent trimming).
Current version of fastq_btrim: fastq_btrim.plCreates barplots of average quality per cycle.
Current version of fastq2qbarplot.pl: fastq2qbarplot.plCalculates the per cycle nucleotide composition.
Current version of nuccomp.pl: nuccomp.plBcsort sorts Illumina short reads by barcode. Input is fastq format, as output by the 'Gerald' process ('N' is the only allowed ambiguous nucleotide). This software removes the barcode and outputs to fastq and fasta format. Execution without options returns information on usage.
This software is agnostic to memory (i.e., better have GBs of RAM), but runs relatively fast (i.e., minutes per millions of reads). It will return an error if it runs out of memory, so user beware.
Due to the rapid rate of change in the Illumina chemistry and software these scripts change frequently (currently, every time I sort barcodes I change something). Feel free to contact me for the latest information. And if you use this software, please send me an e-mail to let me know how my software performed (or mis-performed).
Current version of bcsort for single-end data: bcsort_se.plExample barcodes file (tab delimited): barcodes.txt Note that the user is responsible for adding the quality control nucleotide. In the past this has meant a 'T' immediately 3' of the barcode (as in the example). This appears to be changing.
Knaus, B.J. 2010. Short read toolbox. http://brianknaus.com.
Back to BrianKnaus.com
Copyright (c) 2010 Brian J. Knaus, all rights reserved.