Nucleotide_Essentials.jl
Data Types
Nucleotide_Essentials.FastqRecord
— TypeNucleotide_Essentials.FastqRecord
Components
- ID: The unique sequence identifier associated with that entry
- sequence: The nucleotide sequence of that entry
- quality: The quality scores of that entry
- filename: The original file name
Nucleotide_Essentials.FastaRecord
— TypeNucleotide_Essentials.FastaRecord
Components
- ID: The unique sequence identifier associated with that entry
- sequence: The nucleotide sequence of that entry
- filename: The original file name
Functions
Nucleotide_Essentials.readFastq
— FunctionNucleotide_Essentials.readFastq
readFastq(Path::String)
.fastq file => readFastq(Path) => FastqRecord(ID, sequence, quality, filename)
supported keyword arguments include:
Path::String
: The full or relative path to a .fastq file
Example:
# Supply the path to a .Fastq file that you would like to import
myfastq = readFastq("myfastq.fastq")
Nucleotide_Essentials.readFasta
— FunctionNucleotide_Essentials.readFasta
Imports a .fasta file into julia
readFasta(Path::String)
.fasta file => readFasta(Path) => FastaRecord(ID, sequence, filename)
Supported keyword arguments include:
Path::String
: The full or relative path to a .fasta file
Example:
# Supply the path to a .fasta file that you would like to import - it is recommended to include `;` in your command to prevent printing potentially large .fasta files in the REPL
myfasta = readFasta("myfasta.fasta");
Nucleotide_Essentials.writeFasta
— FunctionNucleotide_Essentials.writeFasta
readFasta(Path::String) FastaRecord => writefasta(inputfasta, out, compressed) => .fasta file/.fasta.gz file
Creates a single or multiple entry FastaRecord and outputs either a .fasta or compressed .fasta.gz file to the desired directory
Supported keyword arguments include:
input_fasta::FastaRecord
: A FastaRecord with either a single entry or multiple entriesout::String
: The full or relative path to the directory where files should be written tocompressed::Bool
: Whether or not to write the .fasta files as compressed files or not.- If
true
, files will written as .fasta.gz files - If
false
, files will written as .fasta files
- If
Example:
# .fasta files can be written as from an already imported FastaRecord in Julia
myfasta = readFasta("myfasta.fasta");
writeFasta(input_fasta, "example/output/directory, false)
# .fasta files can be written as a .fasta.gz from an already imported FastaRecord in Julia
myfasta = readFasta("myfasta.fasta");
writeFasta(input_fasta, "example/output/directory", true)
# .fasta files with multiple sequences can be read and written as individual .fasta or .fasta.gz in the same step
myfasta = readFasta("myfasta.fasta");
writeFasta(readFasta("/myfasta.fasta"), "example/output/directory", true);
Nucleotide_Essentials.FastqtoFasta
— FunctionNucleotide_Essentials.FastqtoFasta
Converts a FastqRecord to a FastaRecord. Can also input and convert a .fastq file to a FastaRecord in the same function.
FastqtoFasta(Fastq::Union{String, FastqRecord}) .fastq file => FastqtoFasta(Fastq) => FastaRecord(ID, sequence, filename) FastqRecord(ID, sequence, quality, filename) => FastqtoFasta(Fastq) => FastaRecord(ID, sequence, filename)
Supported keyword arguments include:
Fastq::Union{String, FastqRecord}
:- The full or relative path to a .fastq file
- A FastqRecord
Example:
# Supply the path to a .fastq file that you would like to convert to a FastRecord
myfasta = FastqtoFasta("myfastq.fastq");
# Alternatively, a FastqRecord can be used as the input
myfasta = FastqtoFasta(myFastqRecord);
Nucleotide_Essentials.FilterQuality_se
— FunctionNucleotide_Essentials.FilterQuality_se
Filters an input .fastq file based upon the encoded Phred+33 or Phred+64 quality scores. The encoding of the reads is automatically deteremined by looking for unique encoding in Phred+33 and Phred+64. Phred+64 encoding is identified by searching for ^
, a
, ]
, and f
.
Reads are filtered based upon the number of expected errors ($\mathrm{E}$) based on the error rate based on quality score and the sum of error probabilities, following the equation:
$\mathrm{E} = \sum{_ip_i} = \sum{_i}10^{\frac{-Q_i}{10}}$
Stringent filtering (maxEE = 1
) is used by default but can be adjusted by the user.
Reads that pass the filtering parameters are output to a file ending in _FilteredReads.fastq
in the user-determined directory, as indicated by out
.
Supported keyword arguments include:
read1::String
: Path to the reads to undergo quality filteringout::String
: Path to the directory where reads that pass the quality filtering should be writtenmaxEE::Int64
(optional): The max number of expected errors a read can include as the filtering parameter (default:maxEE = 1
)verbose::Bool
(optional): Whether or not to show some intermediary feedback on the progress of the function (default = false)
Example:
FilterQuality_se("forward_R1.fasta", "/outdirectory")
Nucleotide_Essentials.FilterQuality_pe
— FunctionNucleotide_Essentials.FilterQuality_pe
Filters an input .fastq file based upon the encoded Phred+33 or Phred+64 quality scores. The encoding of the reads is automatically deteremined by looking for unique encoding in Phred+33 and Phred+64. Phred+64 encoding is identified by searching for ^
, a
, ]
, and f
.
Reads are filtered based upon the number of expected errors ($\mathrm{E}$) based on the error rate based on quality score and the sum of error probabilities, following the equation:
$\mathrm{E} = \sum{_ip_i} = \sum{_i}10^{\frac{-Q_i}{10}}$
Stringent filtering (maxEE = 1
) is used by default but can be adjusted by the user.
Output Files:
- If both the forward and reverse reads pass the filtering parameters:
- Forward reads are output to a file ending in
R1_Paired_filtered.fastq
in the user-determined directory, as indicated byout
- Reverse reads are output to a file ending in
R2_Paired_filtered.fastq
in the user-determined directory, as indicated byout
- Forward reads are output to a file ending in
- If only the forward read passes the filtering parameters:
- Forward reads are output to a file ending in
R1_Unpaired_filtered.fastq
in the user-determined directory, as indicated byout
- Reverse reads are not written to a file
- Forward reads are output to a file ending in
- If only the reverse read passes the filtering parameters:
- Reverse reads are output to a file ending in
R2_Unpaired_filtered.fastq
in the user-determined directory, as indicated byout
- Forward reads are not written to a file
- Reverse reads are output to a file ending in
Supported keyword arguments include:
read1::String
: Path to the forward reads to undergo quality filteringread2::String
: Path to the reverse reads to undergo quality filteringout::String
: Path to the directory where reads that pass the quality filtering should be writtenmaxEE::Int64
(optional): The max number of expected errors a read can include as the filtering parameter (default:maxEE = 1
)verbose::Bool
(optional): Whether or not to show some intermediary feedback on the progress of the function (default = false)
Example:
FilterQuality_pe("forward_R1.fasta", "reverse_R2.fasta", "/outdirectory")
# changing the filtering parameters
FilterQuality_pe("forward_R1.fasta", "reverse_R2.fasta", "/outdirectory", 2, true)
Nucleotide_Essentials.PlotQuality
— FunctionNucleotide_Essentials.PlotQuality
Returns a plot of the quality profile of a .fastq or .fastq.gz file
This function plots a visual summary of the distribution of quality scores (automatically detects Phred+33 or Phred+64 encoding) as a function of sequence position for the input fastq file(s).
The plotted lines show summary statistics at each sequence position:
- green is the mean
- dashed red lines are the 25th and 75th quantiles
Supported keyword arguments include:
Input::FastqRecord
: The name of a FastqRecord for plottingverbose::Bool
(optional): Whether or not to show some intermediary feedback on the progress of the function (default = false)outputfigure::Bool
(optional): Whether or not to output a .png file with the created QualityPlot (default = false)figurepath::String
(optional): If outputting a .png figure to file, specify the path to a directory where the file should be written to (default =pwd()
)
Example:
# A quality profile can be created by supply the path to a .fastq or .fastq.gz file
PlotQuality("path/to/my/file.fastq")
Nucleotide_Essentials.potential_mismatches
— FunctionNucleotide_Essentials.potential_mismatches
Returns an Vector{Any}
of potential barcodes with a single nucleotide change, including both deletions and substitutions
Supported keyword arguments include:
Path::String
: The full or relative path to a .fastq filemismatch::Int64
: The number of altered nucleotides to include (1 is only supported at this time)
Example:
potential_mismatches("GCGT", 1)
17-element Vector{Any}:
"GCGT"
"CCGT"
"ACGT"
"TCGT"
"GGGT"
"GAGT"
"GTGT"
"GCCT"
"GCAT"
"GCTT"
"GCGG"
"GCGC"
"GCGA"
"CGT"
"GGT"
"GCT"
"GCG"
Nucleotide_Essentials.reverse_complement
— FunctionNucleotide_Essentials.reverse_complement
Takes a string of nucleotide bases and returns the reverse complement of that string. Accepts inputs of String
and SubString{String}
(input from a FastqRecord)
Supported keyword arguments include:
sequence::Union{String, SubString{String}}
: A string sequence of nucleotide bases or sequence entry from a FastqRecord
Example:
reverse_complement("ATCGT")
"ACGAT"
Nucleotide_Essentials.demultiplex_se
— FunctionNucleotide_Essentials.demultiplex_se
Compares a list of provided barcodes with the provided multiplexed reads and separates the reads into individual .fastq files. If a barcode is found within the read, the barcode is removed from the sequence. The quality data of the reads is preserved and written to the outputted .fastq file. If a barcode is not found, the sequnce and quality is written to the unassigned .fastq file unchanged.
The mapping file must be either a .csv or .txt file with two columns. The first column heading must be SampleID
and the second column heading must be BarcodeSequence
.
EXAMPLE MAPPING FILE:
SampleID | BarcodeSequence |
---|---|
Sample1 | Barcode1 |
Sample2 | Barcode2 |
Sample3 | Barcode3 |
Sample4 | Barcode4 |
Sample5 | Barcode5 |
Sample6 | Barcode6 |
Sample7 | Barcode7 |
Sample8 | Barcode8 |
Supported keyword arguments include:
R1::String
: Path to multiplexed readsMap::String
: Path to the mapping filemismatch::Int64=0
(optional): Number of allowed mismatches in barcode. Potential options include 0 or 1. If 1 mismatch, computation time will significantly increase. Default is to allow for 0 mismatches (exact matches only).debug::Bool=false
(optional): If true, a log file will be created and debugging data will be printed while the function is running (default is false).
Example:
demultiplex_se("multiplexreads.fastq", "mapping_file.fastq")
Nucleotide_Essentials.demultiplex_pe
— FunctionNucleotide_Essentials.demultiplex_pe
Compares a list of provided barcodes with the provided paired-end multiplexed reads and separates the reads into individual .fastq files. If a barcode is found within R1 reads, the barcode is removed from the sequence. The quality data of the reads is preserved and written to the outputted .fastq file. If a barcode is not found, the sequnce and quality is written to the R1 unassigned .fastq file unchanged. If a barcode is found within R2 reads, the barcode is removed from the sequence. The quality data of the reads is preserved and written to the outputted .fastq file. If a barcode is not found, the sequnce and quality is written to the R2 unassigned .fastq file unchanged.
Dual-indexed reads are not yet supported
The mapping file must be either a .csv or .txt file with two columns. The first column heading must be SampleID
and the second column heading must be BarcodeSequence
.
EXAMPLE MAPPING FILE:
SampleID | BarcodeSequence |
---|---|
Sample1 | Barcode1 |
Sample2 | Barcode2 |
Sample3 | Barcode3 |
Sample4 | Barcode4 |
Sample5 | Barcode5 |
Sample6 | Barcode6 |
Sample7 | Barcode7 |
Sample8 | Barcode8 |
Supported keyword arguments include:
R1::String
: Path to forward multiplexed readsR2::String
: Path to reverse multiplexed readsMap::String
: Path to the mapping filemismatch::Int64=0
(optional): Number of allowed mismatches in barcode. Potential options include 0 or 1. If 1 mismatch, computation time will significantly increase. Default is to allow for 0 mismatches (exact matches only).debug::Bool=false
(optional): If true, a log file will be created and debugging data will be printed while the function is running (default is false).
Example:
demultiplex_pe("forward_multiplexreads.fastq", "reverse_multiplexreads.fastq", "mapping_file.fastq")
Index
Nucleotide_Essentials.FastaRecord
Nucleotide_Essentials.FastqRecord
Nucleotide_Essentials.FastqtoFasta
Nucleotide_Essentials.FilterQuality_pe
Nucleotide_Essentials.FilterQuality_se
Nucleotide_Essentials.PlotQuality
Nucleotide_Essentials.demultiplex_pe
Nucleotide_Essentials.demultiplex_se
Nucleotide_Essentials.potential_mismatches
Nucleotide_Essentials.readFasta
Nucleotide_Essentials.readFastq
Nucleotide_Essentials.reverse_complement
Nucleotide_Essentials.writeFasta
Change Log
Nucleotide_Essentials v0.2.0
- Added support for quality filtering of .fastq reads
- Added support for Gzip compressed files
- Performance improvements in
PlotQuality()
and added support for exporting quality plots - Added support for automatic quality profile encoding detection (Phred+64 and Phred+33 encoding)
- Minor documentation updates