Personal tools
You are here: Home Documentation Standardized Fastq format aka "fastq2"

Standardized Fastq format aka "fastq2"

Description Of Format

A fastq file is an ASCII encoded text file that stores DNA or RNA sequences and their corresponding IDs and quality scores. It uses unix newlines and consists of 4 lines per sequence unless wrapping occurs due to sequence length. The first line begins with an "@" followed by an identifier (ID) which acts as a label for the read/sequence. The second line represents a DNA or RNA sequence, and should consist only of standard bases, and IUPAC ambiguity codes (ACTGNURYSWKMBDHV). This line must be wrapped with newlines if the read is longer than 80nt. The third line must be a single "+" which signifies the end of the sequence. The fourth line is a quality score string showing the quality of each base in the prior sequence, represented as the ASCII character corresponding to the quality Phred score + 33. Phred scores must be 0 and 60 (ASCII chars 33 aka "!" to 93 aka "]"). The quality score must also be wrapped to multiple lines if longer than 80 characters, but must be exactly equal in length to it's corresponding sequence.

File Format Standardization Requirements

  1. A second copy of the identifier (after the +) must not be included (it's redundant and increases file size for no reason)
  2. Quality score lines should never start with the extra !, and should always be equal in length to the given read
  3. Quality scores should always be encoded as ASCII char = Phred Score + 33 (this seems to be the most common and universal standard)
  4. DNA or RNA Sequences should contain only the following ASCII characters: ACTGNURYSWKMBDHVN.-
  5. Quality score or sequence lines should be wrapped with a newline after 80 characters
  6. IDs must be 79 characters or less (80 including the leading "@")
  7. All sequences should have quality scores (use fasta otherwise)
  8. Filename should always have the extension ".fastq" when on a file system that supports long extensions, and .fq otherwise
  9. Paired end reads must be represented as a set of two files, with the same number of sequences, and the same IDs in the same order. The first reads from each file correspond to the first pair, the second to the second and so on.

Example

@1
ACGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGGCT
+
!###########]#######################
@2
GCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGGCT
+
####################################
@3
GGCAAAGCTGATGGAATTGGGTCTAATTTTTGTAGG
+
####################################
@4
GTTTAAGAGCCTCGATACGCTCAAAGTCAAAATAAT
+
####################################

This example contains four sequences labled "1", "2", "3", and "4" respectively. The predominant quality score here is "#" which has an ASCII value of 35, and a resulting Phred score of 35-33=2, which is very low.

Validator Script

The following perl application (tested on Linux) will validate your fastq file relative to the standard defined herein: validateFastq

Useage: ./validateFastq <filename.fastq>

Performance: It takes about 10 minutes to validate 1 million 40nt sequences using a 3.40GHz Intel(R) Xeon(TM) CPU




Document Actions