Sequencing Read QC

Complexity of the Cannabis Genome:

  • 400Mb est via Next Gen Sequencing. Kew Gardens estimate of 700-800Mb
  • Diploid
  • 67% AT
  • Highly Polymorphic (SNP every 10-20bp in Synthase genes)
  • Highly repetitive
  • Selective breeding for 30 years for chemotyptic QTLs
    • 5-10 week breeding cycle
  • Sativa, Indica, Ruderalis likely interbred

Due to this complexity we have incorporated a diverse set of tools to decode the Cannabis Genome.

  • Breeding collaborations with DNA Genetics in Amsterdam to obtain triple backcrossed Pure Indica DNA from Cannabis Cup winner LA Confidential
  • Roche has kindly provided early access to their 750bp run module on the GS-FLX+ platform. Over 15M 700bp reads (630 Average, 750 mode) of sequence was obtained through their service center. Preliminary assemblies are with CLC bio.
  • 131 Gb of Illumina HiSeq 2 x 100 reads from 230bp inserts applied to a Sativa/Hybrid cultivar ChemDawg.
  • Breeding collaboration with Greenhouse Seeds to advise on high CBD inbred landraces and Ruderalis DNA.
  • We are investigating Long Mate Pair SOLiD sequencing for super contig generation and Ion Torrent for validation.
  • 92Mb of RNA-Seq has become partially available this summer via a web portal blast server at Medicinal Plant Genomics Resource (unrelated and a different cultivars). http://medicinalplantgenomics.msu.edu/

Read Length from 454 750bp Run Module

FastQCFastQC Report on ILMN HiSeq Data

Fri 1 Jul 2011lane2_49.5m.2.sequence.txt

[FAIL] Basic Statistics

Measure Value
Filename lane2_49.5m.2.sequence.txt
File type Conventional base calls
Encoding Illumina 1.5
Total Sequences 49500000
Sequence length 101
%GC 33

[FAIL] Per base sequence quality

Per base quality graph

[FAIL] Per sequence quality scores

Per Sequence quality graph

[FAIL] Per base sequence content

Per base sequence content

[FAIL] Per base GC content

Per base GC content graph

[FAIL] Per sequence GC content

Per sequence GC content graph

[FAIL] Per base N content

N content graph

[FAIL] Sequence Length Distribution

Sequence length distribution

[FAIL] Sequence Duplication Levels

Duplication level graph

[FAIL] Overrepresented sequences

No overrepresented sequences

[FAIL] Kmer Content

Kmer graph

Sequence Count Obs/Exp Overall Obs/Exp Max Max Obs/Exp Position
GGGGG 3417040 5.193743 9.2707405 95-96
CCCCC 2201195 3.5637 4.859973 95-96
GGGCC 2064415 3.2180467 3.7392786 3
GGCCC 1983870 3.131778 3.508401 95-96