Medicinal Genomics’ 2011 Cannabis Genome Project
In 2011, Medicinal Genomics sequenced the entire genome of Cannabis sativa and Cannabis indica, over 131 billion bases, and assembled the largest known gene collection of this therapeutic plant. Prior to MGC’s work, only 2 million bases of cannabis sequence had been deposited in GenBank, a sequence database provided by the National Center for Biotechnology Information (NCBI). Medicinal Genomics has published the raw reads from Cannabis sativa on Amazon’s EC2, a public cloud computing service, to encourage further scientific research.
Project CBD Article – “Sequencing the Cannabis Genome” featuring Medicinal Genomics
Key Stats
- Diploid
- 67% AT
- Highly repetitive
- 400Mb est via Next Gen Sequencing. Kew Gardens estimate of 700-800Mb
- Highly Polymorphic (SNP every 10-20bp in Synthase genes)
- Selective breeding for 30 years for chemotyptic QTLs
- 5-10 week breeding cycle
- Sativa, Indica, Ruderalis likely interbred
Tools to Decode the Cannabis Genome:
- Breeding collaborations with DNA Genetics in Amsterdam to obtain triple backcrossed Pure Indica DNA from Cannabis Cup winner LA Confidential
- Roche has kindly provided early access to their 750bp run module on the GS-FLX+ platform. Over 15M 700bp reads (630 Average, 750 mode) of sequence was obtained through their service center.
- Preliminary assemblies are with CLC bio.
- 131 Gb of Illumina HiSeq 2 x 100 reads from 230bp inserts applied to a Sativa/Hybrid cultivar ChemDawg.
- Breeding collaboration with Greenhouse Seeds to advise on high CBD inbred landraces and Ruderalis DNA.
- 92Mb of RNA-Seq has become partially available via a web portal blast server at Medicinal Plant Genomics Resource. http://medicinalplantgenomics.msu.edu/
Read Length from 454 750bp Run Module:
FastQC Report on ILMN HiSeq Data
- Fri 1 Jul 2011lane2_49.5m.2.sequence.txt
Summary
Basic Statistics
Measure | Value |
---|---|
Filename | lane2_49.5m.2.sequence.txt |
File type | Conventional base calls |
Encoding | Illumina 1.5 |
Total Sequences | 49500000 |
Sequence length | 101 |
%GC | 33 |
Per base sequence quality
Per sequence quality scores
Per base sequence content
Per base GC content
Per sequence GC content
Per base N content
Sequence Length Distribution
Sequence Duplication Levels
Overrepresented sequences
No overrepresented sequences
Kmer Content
Sequence | Count | Obs/Exp Overall | Obs/Exp Max | Max Obs/Exp Position |
---|---|---|---|---|
GGGGG | 3417040 | 5.193743 | 9.2707405 | 95-96 |
CCCCC | 2201195 | 3.5637 | 4.859973 | 95-96 |
GGGCC | 2064415 | 3.2180467 | 3.7392786 | 3 |
GGCCC | 1983870 | 3.131778 | 3.508401 | 95-96 |
Sequencing DNA Assembly
Insert size came back at 238bp but was set at 250-600. Stringency set at 0.9.
- Prep7_1-3_1_sequence (paired) summary report-2
- Prep7_1-1_1_Orig_Assembly_Stats (paired) summary report
- Prep7_1-7_1_Lanes7_8_sequence (paired) de novo assembly report
BLAST REPORTS OF ALL 174K CONTIGS AGAINST THC SYNTHASE
- Prep7_1-3_1_sequence(paired)contig18025 BLAST
- Prep7_1-3_1_sequence(paired)contig19853 BLAST
- Prep7_1-3_1_sequence(paired)contig28261 BLAST
- Prep7_1-3_1_sequence(paired)contig31598 BLAST
- Prep7_1-3_1_sequence(paired)contig32061 BLAST
- Prep7_1-3_1_sequence(paired)contig38887 BLAST
- Prep7_1-3_1_sequence(paired)contig54345 BLAST
- Prep7_1-3_1_sequence(paired)contig69418 BLAST
- Prep7_1-3_1_sequence(paired)contig70172 BLAST
- Prep7_1-3_1_sequence(paired)contig73347 BLAST
- Prep7_1-3_1_sequence(paired)contig87013 BLAST
- Prep7_1-3_1_sequence(paired)contig108128 BLAST
- Prep7_1-3_1_sequence(paired)contig137854 BLAST
- Prep7_1-3_1_sequence(paired)contig165876 BLAST
Assembly reports for 454 Assemblies with 20 SFF files
- LA-454_Assem10_0.8_20FSFF_(single) summary report
- Assem_11_20SFF_0.9_Vote_summary report
- LA-454_Assem_12- summary report
- LA-454_Assem_13_32SFF_0.99_remap(single) summary report