Nucleotide_Essentials.jl
Data Types
Nucleotide_Essentials.FastqRecord — TypeNucleotide_Essentials.FastqRecordComponents
- ID: The unique sequence identifier associated with that entry
- sequence: The nucleotide sequence of that entry
- quality: The quality scores of that entry
- filename: The original file name
Nucleotide_Essentials.FastaRecord — TypeNucleotide_Essentials.FastaRecordComponents
- ID: The unique sequence identifier associated with that entry
- sequence: The nucleotide sequence of that entry
- filename: The original file name
Functions
Nucleotide_Essentials.readFastq — FunctionNucleotide_Essentials.readFastq
readFastq(Path::String).fastq file => readFastq(Path) => FastqRecord(ID, sequence, quality, filename)
supported keyword arguments include:
Path::String: The full or relative path to a .fastq file
Example:
# Supply the path to a .Fastq file that you would like to import
myfastq = readFastq("myfastq.fastq")Nucleotide_Essentials.readFasta — FunctionNucleotide_Essentials.readFastaImports a .fasta file into julia
readFasta(Path::String).fasta file => readFasta(Path) => FastaRecord(ID, sequence, filename)
Supported keyword arguments include:
Path::String: The full or relative path to a .fasta file
Example:
# Supply the path to a .fasta file that you would like to import - it is recommended to include `;` in your command to prevent printing potentially large .fasta files in the REPL
myfasta = readFasta("myfasta.fasta");Nucleotide_Essentials.writeFasta — FunctionNucleotide_Essentials.writeFastareadFasta(Path::String) FastaRecord => writefasta(inputfasta, out, compressed) => .fasta file/.fasta.gz file
Creates a single or multiple entry FastaRecord and outputs either a .fasta or compressed .fasta.gz file to the desired directory
Supported keyword arguments include:
input_fasta::FastaRecord: A FastaRecord with either a single entry or multiple entriesout::String: The full or relative path to the directory where files should be written tocompressed::Bool: Whether or not to write the .fasta files as compressed files or not.- If
true, files will written as .fasta.gz files - If
false, files will written as .fasta files
- If
Example:
# .fasta files can be written as from an already imported FastaRecord in Julia
myfasta = readFasta("myfasta.fasta");
writeFasta(input_fasta, "example/output/directory, false)
# .fasta files can be written as a .fasta.gz from an already imported FastaRecord in Julia
myfasta = readFasta("myfasta.fasta");
writeFasta(input_fasta, "example/output/directory", true)
# .fasta files with multiple sequences can be read and written as individual .fasta or .fasta.gz in the same step
myfasta = readFasta("myfasta.fasta");
writeFasta(readFasta("/myfasta.fasta"), "example/output/directory", true);Nucleotide_Essentials.FastqtoFasta — FunctionNucleotide_Essentials.FastqtoFastaConverts a FastqRecord to a FastaRecord. Can also input and convert a .fastq file to a FastaRecord in the same function.
FastqtoFasta(Fastq::Union{String, FastqRecord}) .fastq file => FastqtoFasta(Fastq) => FastaRecord(ID, sequence, filename) FastqRecord(ID, sequence, quality, filename) => FastqtoFasta(Fastq) => FastaRecord(ID, sequence, filename)
Supported keyword arguments include:
Fastq::Union{String, FastqRecord}:- The full or relative path to a .fastq file
- A FastqRecord
Example:
# Supply the path to a .fastq file that you would like to convert to a FastRecord
myfasta = FastqtoFasta("myfastq.fastq");
# Alternatively, a FastqRecord can be used as the input
myfasta = FastqtoFasta(myFastqRecord);Nucleotide_Essentials.FilterQuality_se — FunctionNucleotide_Essentials.FilterQuality_seFilters an input .fastq file based upon the encoded Phred+33 or Phred+64 quality scores. The encoding of the reads is automatically deteremined by looking for unique encoding in Phred+33 and Phred+64. Phred+64 encoding is identified by searching for ^, a, ], and f.
Reads are filtered based upon the number of expected errors ($\mathrm{E}$) based on the error rate based on quality score and the sum of error probabilities, following the equation:
$\mathrm{E} = \sum{_ip_i} = \sum{_i}10^{\frac{-Q_i}{10}}$
Stringent filtering (maxEE = 1) is used by default but can be adjusted by the user.
Reads that pass the filtering parameters are output to a file ending in _FilteredReads.fastq in the user-determined directory, as indicated by out.
Supported keyword arguments include:
read1::String: Path to the reads to undergo quality filteringout::String: Path to the directory where reads that pass the quality filtering should be writtenmaxEE::Int64(optional): The max number of expected errors a read can include as the filtering parameter (default:maxEE = 1)verbose::Bool(optional): Whether or not to show some intermediary feedback on the progress of the function (default = false)
Example:
FilterQuality_se("forward_R1.fasta", "/outdirectory")Nucleotide_Essentials.FilterQuality_pe — FunctionNucleotide_Essentials.FilterQuality_peFilters an input .fastq file based upon the encoded Phred+33 or Phred+64 quality scores. The encoding of the reads is automatically deteremined by looking for unique encoding in Phred+33 and Phred+64. Phred+64 encoding is identified by searching for ^, a, ], and f.
Reads are filtered based upon the number of expected errors ($\mathrm{E}$) based on the error rate based on quality score and the sum of error probabilities, following the equation:
$\mathrm{E} = \sum{_ip_i} = \sum{_i}10^{\frac{-Q_i}{10}}$
Stringent filtering (maxEE = 1) is used by default but can be adjusted by the user.
Output Files:
- If both the forward and reverse reads pass the filtering parameters:
- Forward reads are output to a file ending in
R1_Paired_filtered.fastqin the user-determined directory, as indicated byout - Reverse reads are output to a file ending in
R2_Paired_filtered.fastqin the user-determined directory, as indicated byout
- Forward reads are output to a file ending in
- If only the forward read passes the filtering parameters:
- Forward reads are output to a file ending in
R1_Unpaired_filtered.fastqin the user-determined directory, as indicated byout - Reverse reads are not written to a file
- Forward reads are output to a file ending in
- If only the reverse read passes the filtering parameters:
- Reverse reads are output to a file ending in
R2_Unpaired_filtered.fastqin the user-determined directory, as indicated byout - Forward reads are not written to a file
- Reverse reads are output to a file ending in
Supported keyword arguments include:
read1::String: Path to the forward reads to undergo quality filteringread2::String: Path to the reverse reads to undergo quality filteringout::String: Path to the directory where reads that pass the quality filtering should be writtenmaxEE::Int64(optional): The max number of expected errors a read can include as the filtering parameter (default:maxEE = 1)verbose::Bool(optional): Whether or not to show some intermediary feedback on the progress of the function (default = false)
Example:
FilterQuality_pe("forward_R1.fasta", "reverse_R2.fasta", "/outdirectory")
# changing the filtering parameters
FilterQuality_pe("forward_R1.fasta", "reverse_R2.fasta", "/outdirectory", 2, true)Nucleotide_Essentials.PlotQuality — FunctionNucleotide_Essentials.PlotQualityReturns a plot of the quality profile of a .fastq or .fastq.gz file
This function plots a visual summary of the distribution of quality scores (automatically detects Phred+33 or Phred+64 encoding) as a function of sequence position for the input fastq file(s).
The plotted lines show summary statistics at each sequence position:
- green is the mean
- dashed red lines are the 25th and 75th quantiles
Supported keyword arguments include:
Input::FastqRecord: The name of a FastqRecord for plottingverbose::Bool(optional): Whether or not to show some intermediary feedback on the progress of the function (default = false)outputfigure::Bool(optional): Whether or not to output a .png file with the created QualityPlot (default = false)figurepath::String(optional): If outputting a .png figure to file, specify the path to a directory where the file should be written to (default =pwd())
Example:
# A quality profile can be created by supply the path to a .fastq or .fastq.gz file
PlotQuality("path/to/my/file.fastq")Nucleotide_Essentials.potential_mismatches — FunctionNucleotide_Essentials.potential_mismatchesReturns an Vector{Any} of potential barcodes with a single nucleotide change, including both deletions and substitutions
Supported keyword arguments include:
Path::String: The full or relative path to a .fastq filemismatch::Int64: The number of altered nucleotides to include (1 is only supported at this time)
Example:
potential_mismatches("GCGT", 1)
17-element Vector{Any}:
"GCGT"
"CCGT"
"ACGT"
"TCGT"
"GGGT"
"GAGT"
"GTGT"
"GCCT"
"GCAT"
"GCTT"
"GCGG"
"GCGC"
"GCGA"
"CGT"
"GGT"
"GCT"
"GCG"Nucleotide_Essentials.reverse_complement — FunctionNucleotide_Essentials.reverse_complementTakes a string of nucleotide bases and returns the reverse complement of that string. Accepts inputs of String and SubString{String} (input from a FastqRecord)
Supported keyword arguments include:
sequence::Union{String, SubString{String}}: A string sequence of nucleotide bases or sequence entry from a FastqRecord
Example:
reverse_complement("ATCGT")
"ACGAT"Nucleotide_Essentials.demultiplex_se — FunctionNucleotide_Essentials.demultiplex_seCompares a list of provided barcodes with the provided multiplexed reads and separates the reads into individual .fastq files. If a barcode is found within the read, the barcode is removed from the sequence. The quality data of the reads is preserved and written to the outputted .fastq file. If a barcode is not found, the sequnce and quality is written to the unassigned .fastq file unchanged.
The mapping file must be either a .csv or .txt file with two columns. The first column heading must be SampleID and the second column heading must be BarcodeSequence.
EXAMPLE MAPPING FILE:
| SampleID | BarcodeSequence |
|---|---|
| Sample1 | Barcode1 |
| Sample2 | Barcode2 |
| Sample3 | Barcode3 |
| Sample4 | Barcode4 |
| Sample5 | Barcode5 |
| Sample6 | Barcode6 |
| Sample7 | Barcode7 |
| Sample8 | Barcode8 |
Supported keyword arguments include:
R1::String: Path to multiplexed readsMap::String: Path to the mapping filemismatch::Int64=0(optional): Number of allowed mismatches in barcode. Potential options include 0 or 1. If 1 mismatch, computation time will significantly increase. Default is to allow for 0 mismatches (exact matches only).debug::Bool=false(optional): If true, a log file will be created and debugging data will be printed while the function is running (default is false).
Example:
demultiplex_se("multiplexreads.fastq", "mapping_file.fastq")Nucleotide_Essentials.demultiplex_pe — FunctionNucleotide_Essentials.demultiplex_peCompares a list of provided barcodes with the provided paired-end multiplexed reads and separates the reads into individual .fastq files. If a barcode is found within R1 reads, the barcode is removed from the sequence. The quality data of the reads is preserved and written to the outputted .fastq file. If a barcode is not found, the sequnce and quality is written to the R1 unassigned .fastq file unchanged. If a barcode is found within R2 reads, the barcode is removed from the sequence. The quality data of the reads is preserved and written to the outputted .fastq file. If a barcode is not found, the sequnce and quality is written to the R2 unassigned .fastq file unchanged.
Dual-indexed reads are not yet supported
The mapping file must be either a .csv or .txt file with two columns. The first column heading must be SampleID and the second column heading must be BarcodeSequence.
EXAMPLE MAPPING FILE:
| SampleID | BarcodeSequence |
|---|---|
| Sample1 | Barcode1 |
| Sample2 | Barcode2 |
| Sample3 | Barcode3 |
| Sample4 | Barcode4 |
| Sample5 | Barcode5 |
| Sample6 | Barcode6 |
| Sample7 | Barcode7 |
| Sample8 | Barcode8 |
Supported keyword arguments include:
R1::String: Path to forward multiplexed readsR2::String: Path to reverse multiplexed readsMap::String: Path to the mapping filemismatch::Int64=0(optional): Number of allowed mismatches in barcode. Potential options include 0 or 1. If 1 mismatch, computation time will significantly increase. Default is to allow for 0 mismatches (exact matches only).debug::Bool=false(optional): If true, a log file will be created and debugging data will be printed while the function is running (default is false).
Example:
demultiplex_pe("forward_multiplexreads.fastq", "reverse_multiplexreads.fastq", "mapping_file.fastq")Index
Nucleotide_Essentials.FastaRecordNucleotide_Essentials.FastqRecordNucleotide_Essentials.FastqtoFastaNucleotide_Essentials.FilterQuality_peNucleotide_Essentials.FilterQuality_seNucleotide_Essentials.PlotQualityNucleotide_Essentials.demultiplex_peNucleotide_Essentials.demultiplex_seNucleotide_Essentials.potential_mismatchesNucleotide_Essentials.readFastaNucleotide_Essentials.readFastqNucleotide_Essentials.reverse_complementNucleotide_Essentials.writeFasta
Change Log
Nucleotide_Essentials v0.2.0
- Added support for quality filtering of .fastq reads
- Added support for Gzip compressed files
- Performance improvements in
PlotQuality()and added support for exporting quality plots - Added support for automatic quality profile encoding detection (Phred+64 and Phred+33 encoding)
- Minor documentation updates