Processing and Analyzing FAST5 files from a minION Run
- Author
Michael Hall
- Date
4/25/2022
FAST5 Files
FAST5 files are raw data produced from a MinION Run. In this tutorial I show two major data processing steps after DNA has been extracted from a model organism. An experiment was done on a Coffee Plant in this case and two fastq files were produced, one failed, the other succeeded. First I would like to determine the quality of the data using Qualimap. Second I would like to run FAST5 files through an R Script pipleine that also assess its quality.
Installation
It is necessary to clone a github repository to do the analysis from my account. First clone it https://github.com/PBGLMichaelHall/FAST5.git.
Data Tree
# I attempt to showcase how to run a FASTQ file through a bioinformatics pipleine
# Downstream Analysis using Basecalling with Guppy
# Invoke guppy_basecaller
# Provide all relevant flags and input parameters
guppy_basecaller --compress_fastq -i fast5/guppy_out/ -f FLO-FLG001 -k SQK-LSK109 --cpu_threads_per_caller 4 --num_callers 1
guppy_basecaller --compress_fastq -i fast5/ -s fast5/guppy_out/ --flowcell FLO-FLG001 --kit SQK-LSK109 --cpu_threads_per_caller 4 --num_callers 1
--compress_fastq Compress fastq output files with gzip.
-i Path to input fast5 files.
-s Path to save fastq files.
--flowcell Flowcell to find a configuration for
--kit Kit to find a configuration for
--cpu_threads_per_caller Number of CPU worker threads per
--num_callers Number of parallel basecallers to create.
# Clone github graphmap repository and make modules
git clone https://github.com/isovic/graphmap.git
cd graphmap
make modules
make
# Specify System file path to graphmap executable program
~/graphmap/bin/Linux-x64/
# Locate fast5 and reference genome fasta files and invoke graphmap program to align
../../../graphmap/bin/Linux-x64/graphmap align -r CoffeeArabica_Cara/ncbi_dataset/data/GCF_003713225.1/Coffee.fna
-t 4 -d fast5/guppy_out/pass/fastq_runid_94dd8d03aa43bb3ac59d61d6b73a0ac5e939526b_0_0.fastq.gz
-o map_to_ref/nanopore.graphmap.sam > map_to_ref/nanopore.graphmap.sam.log 2>&1
-r reference genonme
-t Threads
-d Directory of FastQ file from FAST5 raw data
-o output directory to file ending with .sam
> create log of run
# Invoke samtools stats on sam file
samtools stats -d -@ 4 nanopore.graphmap.sam > nanopore.graphmap.sam.stats
# Inpect first 40 lines of header
head -n 40 nanopore.graphmap.sam.stats
# Invoke samtools view and condition on samtools sort to make a sorted bam file
samtools view -@ 4 -bS nanopore.graphmap.sam | samtools sort - -@ 4
-o nanopore.graphmap.sorted.bam
# Create a Qualimap Environment
conda create --name Qualimap
conda activate Qualimap
conda install -c bioconda qualimap
# Invoke qualimap on sorted bam file an HTML report should be generated automatically
qualimap bamqc -bam nanopore.graphmap.sorted.bam -nw 5000 -nt 4 -outdir nanopore.graphmap
# Index Reference Genome
bwa index Coffee.fna
# make a Directory for vcf file
mkdir vcfplots
# Invoke samtools pileup to align indexed fasta to sorted bam
samtools mpileup -g -f CoffeeArabica_Cara/ncbi_dataset/data/GCF_003713225.1/Coffee.fna map_to_ref/nanopore.graphmap.sorted.bam
| bcftools call -mv -o vcfplots/all.vcf
# Filter for SNPs type and Biallelic sites only
bcftools view -m2 -M2 -v snps -o BIALLELIC~ONLY.vcf all.vcf
# How many variants were called
bcftools view BIALLELIC~ONLY.vcf | wc -l
1468
# run IGV program on Reference Genome, Sorted Bam, and VCF
~/IGV_Linux_2.11.9
bash igv.sh
Load Coffee Arabica Genome
# Use Rsubread package to find genes with mapped read
setwd("/home/michael/FAST5/CoffeeMinIon/20220414_1039_MN19654_AJF976_ed35bf91/map_to_ref")
# Sorted BAM file
# gene annotation file
FeatureCounts<-
Rsubread::featureCounts(files = "nanopore.graphmap.sorted.bam",
annot.ext ="../GCF_003713225.1_Cara_1.0_genomic.gff.gz",isGTFAnnotationFile = TRUE,GTF.featureType = "gene", GTF.attrType = "ID")
annotation <- FeatureCounts$annotation
stat <- FeatureCounts$stat
counts <- FeatureCounts$counts
ASC <- cbind(annotation,counts)
threshold <- 0
ASC <- ASC[(apply(counts,1,min)) > threshold,]
print(ASC)
GeneID Chr Start End Strand Length nanopore.graphmap.sorted.bam
gene-LOC113688632 gene-LOC113688632 NC_039898.1 1392954 1395287 + 2334 81
gene-LOC113697586 gene-LOC113697586 NC_039899.1 884021 886305 + 2285 118
gene-LOC113697595 gene-LOC113697595 NC_039899.1 893020 895356 + 2337 148
gene-LOC113731239 gene-LOC113731239 NC_039901.1 14523710 14529861 - 6152 1
gene-LOC113734903 gene-LOC113734903 NC_039902.1 26262544 26264446 + 1903 1
gene-LOC113697766 gene-LOC113697766 NC_039910.1 22728667 22733865 + 5199 1
gene-LOC113708575 gene-LOC113708575 NC_039914.1 2245618 2250386 - 4769 1
gene-LOC113708187 gene-LOC113708187 NC_039914.1 7384813 7386075 - 1263 3
gene-LOC113707907 gene-LOC113707907 NC_039914.1 7430405 7432582 - 2178 367
gene-LOC113708510 gene-LOC113708510 NC_039914.1 7473077 7475443 - 2367 122
gene-LOC113709599 gene-LOC113709599 NC_039915.1 6555704 6558053 - 2350 5
gene-LOC113710584 gene-LOC113710584 NC_039915.1 6598426 6623704 - 25279 1
gene-LOC113710465 gene-LOC113710465 NC_039915.1 6674915 6677084 - 2170 114
gene-LOC113710464 gene-LOC113710464 NC_039915.1 6714444 6716737 - 2294 90
gene-CoarCp011 gene-CoarCp011 NC_008535.1 16850 21025 - 4176 1
# Look at IGV For the first entry printed ASC object
GeneID: gene-LOC113688632
Chromosome: NC_039898.1
Start: 1392954
End: 1395287
Mapped Reads: 81
We want to make a GRanges object that provides variant calls in CODING regions/locations only
setwd("/home/michael/FAST5/CoffeeMinIon/20220414_1039_MN19654_AJF976_ed35bf91")
library(vcfR)
library(VariantAnnotation)
library(GenomicFeatures)
vcf <- readVcf(file = "vcfplots/BIALLELIC~ONLY.vcf")
rd <- rowRanges(vcf)
# convert annotations to TxDb object
GFFTXB<-makeTxDbFromGFF(file="GCF_003713225.1_Cara_1.0_genomic.gff.gz")
Locate Variants
loc <- locateVariants(rd, GFFTXB, CodingVariants())
How many variants were called in coding locations
length(loc)
38 ranges which means 38 total variants were located wihin a coding region with a particular geneID
#Inspect the head
head(loc,10)
GRanges object with 10 ranges and 9 metadata columns:
seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID
Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <character>
NC_039898.1:1393106_A/G NC_039898.1 1393106 + | coding 23 23 23 186
NC_039898.1:1393391_T/A NC_039898.1 1393391 + | coding 160 160 24 186
NC_039898.1:1395076_G/A NC_039898.1 1395076 + | coding 1134 1134 26 186
NC_039899.1:893176_A/G NC_039899.1 893176 + | coding 23 23 288 3946
NC_039899.1:893660_T/C NC_039899.1 893660 + | coding 360 360 289 3946
NC_039899.1:894131_G/A NC_039899.1 894131 + | coding 574 574 293 3946
NC_039899.1:894232_C/G NC_039899.1 894232 + | coding 675 675 294 3946
NC_039899.1:894271_G/A NC_039899.1 894271 + | coding 714 714 295 3946
NC_039901.1:14526465_T/A NC_039901.1 14526465 - | coding 1144 1144 369 20947
NC_039901.1:14526465_T/A NC_039901.1 14526465 - | coding 1144 1144 369 20948
CDSID GENEID PRECEDEID FOLLOWID
IntegerList> <character> <CharacterList> <CharacterList>
NC_039898.1:1393106_A/G 308 LOC113688632
NC_039898.1:1393391_T/A 309 LOC113688632
NC_039898.1:1395076_G/A 312 LOC113688632
NC_039899.1:893176_A/G 14627 LOC113697595
NC_039899.1:893660_T/C 14628 LOC113697595
NC_039899.1:894131_G/A 14629 LOC113697595
NC_039899.1:894232_C/G 14629 LOC113697595
NC_039899.1:894271_G/A 14629 LOC113697595
NC_039901.1:14526465_T/A 100255,100256,100257,... LOC113731239
NC_039901.1:14526465_T/A 100255,100256,100257,... LOC113731239
So the head shows there are three variants called on the LOC113688632 gene
Failed Minion Run QC Report
setwd("~/FAST5/CoffeeMinIon/20220414_1039_MN19654_AJF976_ed35bf91/fast5/guppy_out")
library(fastqcr)
fastqc_install()
fastqc(fq.dir = "/home/michael/FAST5/CoffeeMinIon/20220414_1039_MN19654_AJF976_ed35bf91/fast5/guppy_out/fail",threads = 4)
FASTQC Directory and HTML OUTPUT
Summary
Basic Statistics
Per base sequence quality
Per sequence quality scores
Per base sequence content
Per sequence GC content
Per base N content
Sequence Length Distribution
Sequence Duplication Levels
Overrepresented sequences
Adpater Content
===============================
Successful Minion Run QC Report
# Set the working directory
setwd("~/FAST5/CoffeeMinIon/20220414_1039_MN19654_AJF976_ed35bf91/fast5/guppy_out")
# Confirm you have fastqcr loaded in your namespace and attached
library(fastqcr)
# If not install it
fastqc_install()
# Call the fastqc function from the package provided passed fastq directory as input argument and specify thread count
fastqc(fq.dir = "/home/michael/FAST5/CoffeeMinIon/20220414_1039_MN19654_AJF976_ed35bf91/fast5/guppy_out/pass",threads = 4)
FASTQC Directory and HTML OUTPUT
Summary
Basic Statistics
Per base sequence quality
Per sequence quality scores
Per base sequence content
Per sequence GC content
Per base N content
Sequence Length Distribution
Sequence Duplication Levels
Overrepresented sequences
Adpater Content
==========
MinIONQC.R
First we want to reorgainze the data structure so the RScript from https://github.com/roblanf/minion_qc can plot,interpret and analyze the MinION reads. Create a summary directory and place final_summary.txt and sequencing_summary.txt. Copy the MinIONQC.R Script into main directory and run the following command line. It will produce a standard output if successful
Channel Summary
Flowcell Overview
GB Per Channel Overview
Length Per Hour
Length Histogram
Length vs. Quality
Quality by Hour
Quality Histogram
Reads Per Hour
Yield by Length
Yield Over Time