Nucleotide_Essentials.jl

Data Types

Nucleotide_Essentials.FastqRecordType
Nucleotide_Essentials.FastqRecord

Components

  • ID: The unique sequence identifier associated with that entry
  • sequence: The nucleotide sequence of that entry
  • quality: The quality scores of that entry
  • filename: The original file name
source
Nucleotide_Essentials.FastaRecordType
Nucleotide_Essentials.FastaRecord

Components

  • ID: The unique sequence identifier associated with that entry
  • sequence: The nucleotide sequence of that entry
  • filename: The original file name
source

Functions

Nucleotide_Essentials.readFastqFunction
Nucleotide_Essentials.readFastq
readFastq(Path::String)

.fastq file => readFastq(Path) => FastqRecord(ID, sequence, quality, filename)

supported keyword arguments include:

  • Path::String: The full or relative path to a .fastq file

Example:

# Supply the path to a .Fastq file that you would like to import
myfastq = readFastq("myfastq.fastq")
source
Nucleotide_Essentials.readFastaFunction
Nucleotide_Essentials.readFasta

Imports a .fasta file into julia

readFasta(Path::String)

.fasta file => readFasta(Path) => FastaRecord(ID, sequence, filename)

Supported keyword arguments include:

  • Path::String: The full or relative path to a .fasta file

Example:

# Supply the path to a .fasta file that you would like to import - it is recommended to include `;` in your command to prevent printing potentially large .fasta files in the REPL
myfasta = readFasta("myfasta.fasta");
source
Nucleotide_Essentials.writeFastaFunction
Nucleotide_Essentials.writeFasta

readFasta(Path::String) FastaRecord => writefasta(inputfasta, out, compressed) => .fasta file/.fasta.gz file

Creates a single or multiple entry FastaRecord and outputs either a .fasta or compressed .fasta.gz file to the desired directory

Supported keyword arguments include:

  • input_fasta::FastaRecord: A FastaRecord with either a single entry or multiple entries
  • out::String: The full or relative path to the directory where files should be written to
  • compressed::Bool: Whether or not to write the .fasta files as compressed files or not.
    • If true, files will written as .fasta.gz files
    • If false, files will written as .fasta files

Example:

# .fasta files can be written as from an already imported FastaRecord in Julia 
myfasta = readFasta("myfasta.fasta");
writeFasta(input_fasta, "example/output/directory, false)

# .fasta files can be written as a .fasta.gz from an already imported FastaRecord in Julia 
myfasta = readFasta("myfasta.fasta");
writeFasta(input_fasta, "example/output/directory", true)

# .fasta files with multiple sequences can be read and written as individual .fasta or .fasta.gz in the same step
myfasta = readFasta("myfasta.fasta");
writeFasta(readFasta("/myfasta.fasta"), "example/output/directory", true);
source
Nucleotide_Essentials.FastqtoFastaFunction
Nucleotide_Essentials.FastqtoFasta

Converts a FastqRecord to a FastaRecord. Can also input and convert a .fastq file to a FastaRecord in the same function.

FastqtoFasta(Fastq::Union{String, FastqRecord}) .fastq file => FastqtoFasta(Fastq) => FastaRecord(ID, sequence, filename) FastqRecord(ID, sequence, quality, filename) => FastqtoFasta(Fastq) => FastaRecord(ID, sequence, filename)

Supported keyword arguments include:

  • Fastq::Union{String, FastqRecord}:
    • The full or relative path to a .fastq file
    • A FastqRecord

Example:

# Supply the path to a .fastq file that you would like to convert to a FastRecord
myfasta = FastqtoFasta("myfastq.fastq");

# Alternatively, a FastqRecord can be used as the input 
myfasta = FastqtoFasta(myFastqRecord);
source
Nucleotide_Essentials.FilterQuality_seFunction
Nucleotide_Essentials.FilterQuality_se

Filters an input .fastq file based upon the encoded Phred+33 or Phred+64 quality scores. The encoding of the reads is automatically deteremined by looking for unique encoding in Phred+33 and Phred+64. Phred+64 encoding is identified by searching for ^, a, ], and f.

Reads are filtered based upon the number of expected errors ($\mathrm{E}$) based on the error rate based on quality score and the sum of error probabilities, following the equation:

$\mathrm{E} = \sum{_ip_i} = \sum{_i}10^{\frac{-Q_i}{10}}$

Stringent filtering (maxEE = 1) is used by default but can be adjusted by the user.

Reads that pass the filtering parameters are output to a file ending in _FilteredReads.fastq in the user-determined directory, as indicated by out.

Supported keyword arguments include:

  • read1::String: Path to the reads to undergo quality filtering
  • out::String: Path to the directory where reads that pass the quality filtering should be written
  • maxEE::Int64 (optional): The max number of expected errors a read can include as the filtering parameter (default: maxEE = 1)
  • verbose::Bool (optional): Whether or not to show some intermediary feedback on the progress of the function (default = false)

Example:

FilterQuality_se("forward_R1.fasta", "/outdirectory")
source
Nucleotide_Essentials.FilterQuality_peFunction
Nucleotide_Essentials.FilterQuality_pe

Filters an input .fastq file based upon the encoded Phred+33 or Phred+64 quality scores. The encoding of the reads is automatically deteremined by looking for unique encoding in Phred+33 and Phred+64. Phred+64 encoding is identified by searching for ^, a, ], and f.

Reads are filtered based upon the number of expected errors ($\mathrm{E}$) based on the error rate based on quality score and the sum of error probabilities, following the equation:

$\mathrm{E} = \sum{_ip_i} = \sum{_i}10^{\frac{-Q_i}{10}}$

Stringent filtering (maxEE = 1) is used by default but can be adjusted by the user.

Output Files:

  • If both the forward and reverse reads pass the filtering parameters:
    • Forward reads are output to a file ending in R1_Paired_filtered.fastq in the user-determined directory, as indicated by out
    • Reverse reads are output to a file ending in R2_Paired_filtered.fastq in the user-determined directory, as indicated by out
  • If only the forward read passes the filtering parameters:
    • Forward reads are output to a file ending in R1_Unpaired_filtered.fastq in the user-determined directory, as indicated by out
    • Reverse reads are not written to a file
  • If only the reverse read passes the filtering parameters:
    • Reverse reads are output to a file ending in R2_Unpaired_filtered.fastq in the user-determined directory, as indicated by out
    • Forward reads are not written to a file

Supported keyword arguments include:

  • read1::String: Path to the forward reads to undergo quality filtering
  • read2::String: Path to the reverse reads to undergo quality filtering
  • out::String: Path to the directory where reads that pass the quality filtering should be written
  • maxEE::Int64 (optional): The max number of expected errors a read can include as the filtering parameter (default: maxEE = 1)
  • verbose::Bool (optional): Whether or not to show some intermediary feedback on the progress of the function (default = false)

Example:

FilterQuality_pe("forward_R1.fasta", "reverse_R2.fasta", "/outdirectory")

# changing the filtering parameters 
FilterQuality_pe("forward_R1.fasta", "reverse_R2.fasta", "/outdirectory", 2, true)
source
Nucleotide_Essentials.PlotQualityFunction
Nucleotide_Essentials.PlotQuality

Returns a plot of the quality profile of a .fastq or .fastq.gz file

This function plots a visual summary of the distribution of quality scores (automatically detects Phred+33 or Phred+64 encoding) as a function of sequence position for the input fastq file(s).

The plotted lines show summary statistics at each sequence position:

  • green is the mean
  • dashed red lines are the 25th and 75th quantiles

Supported keyword arguments include:

  • Input::FastqRecord: The name of a FastqRecord for plotting
  • verbose::Bool (optional): Whether or not to show some intermediary feedback on the progress of the function (default = false)
  • outputfigure::Bool (optional): Whether or not to output a .png file with the created QualityPlot (default = false)
  • figurepath::String (optional): If outputting a .png figure to file, specify the path to a directory where the file should be written to (default = pwd())

Example:

# A quality profile can be created by supply the path to a .fastq or .fastq.gz file 
PlotQuality("path/to/my/file.fastq")
source
Nucleotide_Essentials.potential_mismatchesFunction
Nucleotide_Essentials.potential_mismatches

Returns an Vector{Any} of potential barcodes with a single nucleotide change, including both deletions and substitutions

Supported keyword arguments include:

  • Path::String: The full or relative path to a .fastq file
  • mismatch::Int64: The number of altered nucleotides to include (1 is only supported at this time)

Example:

potential_mismatches("GCGT", 1)
17-element Vector{Any}:
"GCGT"
"CCGT" 
"ACGT" 
"TCGT" 
"GGGT" 
"GAGT" 
"GTGT" 
"GCCT" 
"GCAT" 
"GCTT" 
"GCGG" 
"GCGC" 
"GCGA" 
"CGT"
"GGT"
"GCT" 
"GCG"
source
Nucleotide_Essentials.reverse_complementFunction
Nucleotide_Essentials.reverse_complement

Takes a string of nucleotide bases and returns the reverse complement of that string. Accepts inputs of String and SubString{String} (input from a FastqRecord)

Supported keyword arguments include:

  • sequence::Union{String, SubString{String}}: A string sequence of nucleotide bases or sequence entry from a FastqRecord

Example:

reverse_complement("ATCGT")
"ACGAT"
source
Nucleotide_Essentials.demultiplex_seFunction
Nucleotide_Essentials.demultiplex_se

Compares a list of provided barcodes with the provided multiplexed reads and separates the reads into individual .fastq files. If a barcode is found within the read, the barcode is removed from the sequence. The quality data of the reads is preserved and written to the outputted .fastq file. If a barcode is not found, the sequnce and quality is written to the unassigned .fastq file unchanged.

The mapping file must be either a .csv or .txt file with two columns. The first column heading must be SampleID and the second column heading must be BarcodeSequence.

EXAMPLE MAPPING FILE:

SampleIDBarcodeSequence
Sample1Barcode1
Sample2Barcode2
Sample3Barcode3
Sample4Barcode4
Sample5Barcode5
Sample6Barcode6
Sample7Barcode7
Sample8Barcode8

Supported keyword arguments include:

  • R1::String: Path to multiplexed reads
  • Map::String: Path to the mapping file
  • mismatch::Int64=0 (optional): Number of allowed mismatches in barcode. Potential options include 0 or 1. If 1 mismatch, computation time will significantly increase. Default is to allow for 0 mismatches (exact matches only).
  • debug::Bool=false (optional): If true, a log file will be created and debugging data will be printed while the function is running (default is false).

Example:

demultiplex_se("multiplexreads.fastq", "mapping_file.fastq")
source
Nucleotide_Essentials.demultiplex_peFunction
Nucleotide_Essentials.demultiplex_pe

Compares a list of provided barcodes with the provided paired-end multiplexed reads and separates the reads into individual .fastq files. If a barcode is found within R1 reads, the barcode is removed from the sequence. The quality data of the reads is preserved and written to the outputted .fastq file. If a barcode is not found, the sequnce and quality is written to the R1 unassigned .fastq file unchanged. If a barcode is found within R2 reads, the barcode is removed from the sequence. The quality data of the reads is preserved and written to the outputted .fastq file. If a barcode is not found, the sequnce and quality is written to the R2 unassigned .fastq file unchanged.

Dual-indexed reads are not yet supported

The mapping file must be either a .csv or .txt file with two columns. The first column heading must be SampleID and the second column heading must be BarcodeSequence.

EXAMPLE MAPPING FILE:

SampleIDBarcodeSequence
Sample1Barcode1
Sample2Barcode2
Sample3Barcode3
Sample4Barcode4
Sample5Barcode5
Sample6Barcode6
Sample7Barcode7
Sample8Barcode8

Supported keyword arguments include:

  • R1::String: Path to forward multiplexed reads
  • R2::String: Path to reverse multiplexed reads
  • Map::String: Path to the mapping file
  • mismatch::Int64=0 (optional): Number of allowed mismatches in barcode. Potential options include 0 or 1. If 1 mismatch, computation time will significantly increase. Default is to allow for 0 mismatches (exact matches only).
  • debug::Bool=false (optional): If true, a log file will be created and debugging data will be printed while the function is running (default is false).

Example:

demultiplex_pe("forward_multiplexreads.fastq", "reverse_multiplexreads.fastq", "mapping_file.fastq")
source

Index

Change Log

Nucleotide_Essentials v0.2.0

  • Added support for quality filtering of .fastq reads
  • Added support for Gzip compressed files
  • Performance improvements in PlotQuality() and added support for exporting quality plots
  • Added support for automatic quality profile encoding detection (Phred+64 and Phred+33 encoding)
  • Minor documentation updates