Medicinal Genomics’ 2011 Cannabis Genome Project

In 2011, Medicinal Genomics sequenced the entire genome of Cannabis sativa and Cannabis indica, over 131 billion bases, and assembled the largest known gene collection of this therapeutic plant.  Prior to MGC’s work, only 2 million bases of cannabis sequence had been deposited in GenBank, a sequence database provided by the National Center for Biotechnology Information (NCBI).   Medicinal Genomics has published the raw reads from Cannabis sativa on Amazon’s EC2, a public cloud computing service, to encourage further scientific research.

Project CBD Article – “Sequencing the Cannabis Genome” featuring Medicinal Genomics

Key Stats

  • Diploid
  • 67% AT
  • Highly repetitive
  • 400Mb est via Next Gen Sequencing. Kew Gardens estimate of 700-800Mb
  • Highly Polymorphic (SNP every 10-20bp in Synthase genes)
  • Selective breeding for 30 years for chemotyptic QTLs
    • 5-10 week breeding cycle
  • Sativa, Indica, Ruderalis likely interbred

Tools to Decode the Cannabis Genome:

  • Breeding collaborations with DNA Genetics in Amsterdam to obtain triple backcrossed Pure Indica DNA from Cannabis Cup winner LA Confidential
  • Roche has kindly provided early access to their 750bp run module on the GS-FLX+ platform. Over 15M 700bp reads (630 Average, 750 mode) of sequence was obtained through their service center.
  • Preliminary assemblies are with CLC bio.
  • 131 Gb of Illumina HiSeq 2 x 100 reads from 230bp inserts applied to a Sativa/Hybrid cultivar ChemDawg.
  • Breeding collaboration with Greenhouse Seeds to advise on high CBD inbred landraces and Ruderalis DNA.
  • 92Mb of RNA-Seq has become partially available via a web portal blast server at Medicinal Plant Genomics Resource.

Read Length from 454 750bp Run Module:

FastQCFastQC Report on ILMN HiSeq Data

  • Fri 1 Jul 2011lane2_49.5m.2.sequence.txt

[FAIL] Basic Statistics

Measure Value
Filename lane2_49.5m.2.sequence.txt
File type Conventional base calls
Encoding Illumina 1.5
Total Sequences 49500000
Sequence length 101
%GC 33

[FAIL] Per base sequence quality

Per base quality graph

[FAIL] Per sequence quality scores

Per Sequence quality graph

[FAIL] Per base sequence content

Per base sequence content

[FAIL] Per base GC content

Per base GC content graph

[FAIL] Per sequence GC content

Per sequence GC content graph

[FAIL] Per base N content

N content graph

[FAIL] Sequence Length Distribution

Sequence length distribution

[FAIL] Sequence Duplication Levels

Duplication level graph

[FAIL] Overrepresented sequences

No overrepresented sequences

[FAIL] Kmer Content

Kmer graph

Sequence Count Obs/Exp Overall Obs/Exp Max Max Obs/Exp Position
GGGGG 3417040 5.193743 9.2707405 95-96
CCCCC 2201195 3.5637 4.859973 95-96
GGGCC 2064415 3.2180467 3.7392786 3
GGCCC 1983870 3.131778 3.508401 95-96


Sequencing DNA Assembly

Insert size came back at 238bp but was set at 250-600. Stringency set at 0.9.



Assembly reports for 454 Assemblies with 20 SFF files

PCR and Sanger Sequencing show high degree of polymorphism in the THC synthase genes. Assembly implies these regions are easily non specifically PCR’d and may create artifacts.