Gossypium raimondii (D5) genome JGI assembly v2.0 (annot v2.1)

Genome Overview
Analysis NameGossypium raimondii (D5) genome JGI assembly v2.0 (annot v2.1)
MethodArachne2 (modified)
SourceSanger based sequence, Roche 454, and Illumina based short reads
Date performed2013-02-18

Please Note: This genome assembly is made available through a "Reserved Analysis" restriction. Please see the usage policy below for further details.

The following text comes from phytozome.org:

Overview
This v2.1 annotation release is on genome assembly v2.0, a high quality version of the Cotton D (Gossypium raimondii) genome sequenced from DNA provided by Andrew Paterson at Univ. GA. It was sequenced with a combination of Sanger, Roche 454 pyrosequencing and Illumina read pairs. This release includes additional screening of small repetitive contigs and a new map integration that corrects several orientation issues within scaffolds.

Statistics

 Assembly Summary
 Scaffold total 1,033
 Contig total 19,735
 Scaffold sequence total 761.4 Mb
 Contig sequence total 748.1 Mb ( -> 1.7% gap)
 Scaffold N50 (L50) 6 (62.2 Mb)
 Contig N50 (L50) 1,596 (135.6kb)
 41 scaffolds are > 50kb in size, representing approximately 99.0% of the genome


Assembly Details

This release is a high quality version of the Cotton D genome from DNA provided by Andrew Paterson at Univ. GA. It was sequenced with a combination of Sanger based sequence (1.52x assembled coverage with 0.95x coverage from BAC end sequence and fosmids end sequence) Roche 454 pyrosequencing (14.95x linear and 3.1x non-redundant pairs assembled coverage), and Illumina based short reads (primarily to correct 454 insertion/deletion errors) and assembled the genome using our modified version of Arachne2.

V2.0 includes the removal of small repetitive contigs within scaffolds and a new detailed map integration with the recently available tetraploid map that was used to correct several scaffold orientation issues. The combination of:
  •         BES/markers hybridized to FPC contigs (Lin et al, 2010)
  •         Genetic map for the diploid (Rong, et al 2004)
  •         Tetraploid map (Byers et al, 2012)
  •         Vitis vinifera and Theobroma cacao synteny
was used to identify misjoins in the assembly. Misjoins were characterized by a combination of an abrupt change in the linkage group (or synteny) within a region of low BAC/Fosmid support. A total of 13 misjoins were identified and subsequently broken.

Scaffolds were oriented, ordered, and joined together using the aforementioned resources. A total of 51 joins were assembled to form a final assembly containing 13 chromosomes. This release is of suitably high quality to match our previous fully Sanger sequenced plant genomes.


Gene Predictions

85,746 transcript assemblies were made from about 1B pairs of D5 paired-end Illumina RNAseq reads, 55,294 transcript assemblies about 0.25B D5 single end Illumina RNAseq reads, 62,526 transcript assemblies from 0.15B TET single end Illumina RNAseq reads. All these transcript assemblies from RNAseq reads were made using PERTRAN (Shu et. al., manuscript in preparation). 120,929 transcript assemblies were constructed using PASA (Haas, 2003) from 56,638 D5 Sanger ESTs, 2.5M D5 454 RNAseq reads and all RNAseq transcript assemblies above. 133,073 transcript assemblies were constructed using PASA from 296,214 TET Sanger ESTs and about 2.9M TET 454 reads. The larger number of transcript asssemblies from fewer TET sequences is due to fragment nature of the assemblies. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabi (Arabidopsis thaliana), cacao, rice, soybean, grape and poplar proteins to repeat-soft-masked G. raimnondaii genome using RepeatMasker (Smit, 1996-2012) with up to 2K BP extension on both ends unless extending into another locus on the same strand.

Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan (Yeh, 2001). The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed. Final gene set has 37,505 protein coding genes and 77,267 protein coding transcripts.

References:
  • Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. http://nar.oupjournals.org/cgi/content/full/31/19/5654 [Nucleic Acids Res, 31, 5654-5666].
  • Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2011 .
  • Yeh, R.-F., Lim, L. P., and Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816.
  • Salamov, A. A. and Solovyev, V. V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res 10, 516-22.


Reserved Analysis Usage Policy (As described by JGI on Phytozome.org)

For public access, in agreement with Fort Lauderdale, we (JGI) are making the Cotton D genome available from the DOE JGI and our collaborators prior to peer-reviewed publication of the data. We are making this data available with the expectation and desire to publish this data in a reasonable time without preemption by other groups. By accessing these data, you agree not to publish any articles containing analyses of genes or genomic data on a whole genome or chromosome scale prior to publication by the DOE JGI and/or its collaborators of a comprehensive genome analysis ("Reserved Analyses"). "Reserved analyses" include the identification of complete (whole genome) sets of genomic features such as genes, gene families, regulatory elements, repeat structures, GC content, or any other genome feature, and whole-genome- or chromosome- scale comparisons with other species including other cotton species and cultivars. For specific questions about data use please contact Andy Paterson (paterson AT plantbio.uga.edu) and Jeremy Schmutz (jschmutz AT hudsonalpha.org).

Work towards publication of the Cotton D genome is underway, and we plan to submit a manuscript within this calendar year. If you will be employing the data for non-reserved analyses, such as cloning a gene of interest, designing mapping panels or to analyze a gene family etc., please reference the "DOE Joint Genome Institute: Cotton D V2.0" as your citation.

Assembly

The chromosomes (pseudomolecules) and scaffolds for the JGI G. raimondii assembly.  These files belong to the G. raimondii JGI v2.0 assembly

Chromosomes & scaffolds (FASTA format) G.raimondii_JGI_221_v2.0.assembly.fasta.gz
Chromosomes & scaffolds (GFF3 format) G.raimondii_JGI_221_v2.0.assembly.gff3.gz
Assembly GAPs (GFF3 format) G.raimondii_JGI_221_v2.0.gaps.gff3.gz
Chromosomes only (FASTA format) G.raimondii_JGI_221_v2.0.psuedomolecules.fasta.gz
Scaffolds only (FASTA format) G.raimondii_JGI_221_v2.0.scaffolds.fasta.gz
Repeats masked by N's (FASTA format) G.raimondii_JGI_221_v2.0.assembly-hardmasked.fasta.gz
Repeats masked by lower case (FASTA format) G.raimondii_JGI_221_v2.0.assembly-softmasked.fasta.gz
Genes

The predicted gene model, their alignments and proteins for  G. raimondii  genome.  These files belong to the G. raimondii JGI v2.1 annotation set.
 

Predicted gene models with exons (GFF3 format) G.raimondii_JGI_221_v2.1.transcripts_exons.gff3.gz
Predicted gene models no exons (GFF3 format) G.raimondii_JGI_221_v2.1.transcripts.gff3.gz
mRNA sequences (FASTA format) G.raimondii_JGI_221_v2.1.transcripts.fasta.gz
Coding sequences, CDS (FASTA format) G.raimondii_JGI_221_v2.1.CDS.fasta.gz
Protein sequences (FASTA format) G.raimondii_JGI_221_v2.1.proteins.fasta.gz
Repeats
Repeats for the G. raimondii JGI v2.1 annotations are in GFF3 format.  To obtain masked version of the assembled chromosomes and scaffolds, click the 'Assembly' link in the right sidebar.
 
Repeats identified using RepeatMasker G.raimondii_JGI_221_v2.1.repeats.gff3.gz

 

Markers
Marker alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU.  The alignment tool 'BLAT' was used to map marker sequences from CottonGen to the G. raimondii genome assembly.  Markers required 90% identify over 97% of their length.  For SSRs & RFLPs, gap size was restricted to 1000bp or less with less than 2 gaps.  For dbSNPs, CottonGen SNPs and Indels gap size was restricted to 2bp with less than 2 gaps.  The available files are in GFF3 format. Markers available in CottonGen and CMap are linked to GBrowse.
 
All Markers mapped to genome in Excel format G.raimondii_JGI_221_v2.1_markers.xlsx
CottonGen RFLP markers mapped to genome G.raimondii_JGI_221_v2.1_CTG_RFLP.gff3.gz
CottonGen SSR markers mapped to genome G.raimondii_JGI_221_v2.1_CTG_SSRs_update1.gff3.gz
CottonGen SNP markers mapped to genome G.raimondii_JGI_221_v2.1_CTG_SNPs.gff3.gz
CottonGen InDel markers mapped to genome G.raimondii_JGI_221_v2.1_CTG_Indels.gff3.gz

 

Transcript Alignments
With the exception of ESTs mapped by JGI (Phytozome), the transcript alignments available below were performed by the CottonGen Team.  The alignment tool 'BLAT' was used to map transcripts to the G. raimondii genome assembly.  Alignments with an alignment length of 97% and 90% identify were preserved.  The available files are in GFF3 format.

 

dbEST for all Gossypium G.raimondii_JGI_221_v2.1_dbEST.gff3.gz
dbEST for G. arboreum G.raimondii_JGI_221_v2.1_dbEST_G-arboreum.gff3.gz
dbEST for G. barbadense G.raimondii_JGI_221_v2.1_dbEST_G-barbadense.gff3.gz
dbEST for G. herbaceum G.raimondii_JGI_221_v2.1_dbEST_G-herbaceum.gff3.gz
dbEST for G. hirsutum G.raimondii_JGI_221_v2.1_dbEST_G-hirsutum.gff3.gz
dbEST for G. raimondii G.raimondii_JGI_221_v2.1_dbEST_G-raimondii.gff3.gz
mRNA from NCBI for all Gossypium G.raimondii_JGI_221_v2.1_NCBI-mRNA.gff3.gz
CottonGen unigene v1.0 G.raimondii_JGI_221_v2.1_cottongen-unigenes.gff3.gz
J. Udall 2012 Unigene contigs G.raimondii_JGI_221_v2.1_Udall2012-unigenes.gff3.gz
PlantGDB Gossypium unigenes G.raimondii_JGI_221_v2.1_PlantGDB_Gossypium.gff3.gz
PlantGDB G. arboreum unigenes G.raimondii_JGI_221_v2.1_PlantGDB_arboreum.gff3.gz
PlantGDB G. barbadense unigenes G.raimondii_JGI_221_v2.1_PlantGDB_barbadense.gff3.gz
PlantGDB G. hirsutum unigenes G.raimondii_JGI_221_v2.1_PlantGDB_hirsutum.gff3.gz
PlantGDB G. raimondii unigenes G.raimondii_JGI_221_v2.1_PlantGDB_raimondii.gff3.gz
PASA mapped Phytozome ESTs G.raimondii_JGI_221_v2.1.phytozome_ESTs.gff3.gz

 

Protein Alignments
Protein alignments available below were performed by the CottonGen Team.  The alignment tool 'exonerate' was used to map protein sequences onto the G. raimondii   JGI v2.0 genome.  Only alignments with a percent identity of 90% were retained.
 

 

SNPs
Alignments for NCBI dbSNPs and CottonGen SNPs/Indels were performed by the CottonGen Team.  The alignment tool 'BLAT' was used to map CottonGen marker sequences to the G. raimondii genome assembly.  Markers required 90% identify over 97% of their length. 

SNPs from Joshua Udall at Brigham Young University are also provided below.  These SNPs represent differences between the diploid genomes of A and D and  differences between At and Dt. Each are provided in separate files available in GFF3 format. The former collection of diploid SNPs were derived from four D-genomes, two A1 genome and three A2 genomes. The tetraploid SNPs were derived resequencing data of the cultivar Maxxa as published with the D-genome (Paterson et al. 2012). The SNPs were identified by 1) mapping the sequence reads with GSNAP to the D-genome reference sequence, 2) categorizing the reads using PolyCat, and 3) discovery of the SNPs using InnerSNP.  A paper that fully describes the sequences, SNP identification process, and differences between the diploid genomes will be published soon.
 

NCBI cotton dbSNPs mapped to genome G.raimondii_JGI_221_v2.1_dbSNP.gff3.gz
CottonGen SNP markers mapped to genome G.raimondii_JGI_221_v2.1_CTG_SNPs.gff3.gz
CottonGen InDel markers mapped to genome G.raimondii_JGI_221_v2.1_CTG_Indels.gff3.gz
Joshua Udalls A/D (diploid) SNPs G.raimondii_JGI_211_v2.1_Udall_di_SNPs.gff3.gz
Joshua Udall's At/Dt (tetraploid) SNPs G.raimondii_JGI_211_v2.1_Udall_tetra_SNPs.gff3.gz

 

Protein Homology
Protein homology available below was performed by the CottonGen Team of Main Bioinformatics Lab at WSU. Transcripts from the G. raimondii assembly  were mapped against proteins from other genomes and databases using blastx with an e-value cutoff of 1e-6 (1e-9 for TrEMBL and NCBI nr).  Only the best match was kept.   The available files are in Excel 2007 format.
 
Cacao theobroma v0.9 proteins G.raimondii_JGI_221_v2.1_C.theobroma_v0.9_homology.xlsx
Arabidopsis thaliana TAIR10 proteins G.raimondii_JGI_221_v2.1_TAIR10_homology.xlsx
Oryza sativa MSU v7.0 proteins G.raimondii_JGI_221_v2.1_O.sativa_MSUv7_homology.xlsx
Poplar trichocarpa v2.0 proteins G.raimondii_JGI_221_v2.1_P.trichocarpa_v2.0_homology.xlsx
Uniprot SwissProt (Oct 2012) proteins G.raimondii_JGI_221_v2.1_SwissProt_homology.xlsx
Uniprot TrEMBL (Oct 2012) proteins G.raimondii_JGI_221_v2.1_TrEMBL_homology.xlsx
NCBI nr protein homology (Oct 2012) G.raimondii_JGI_221_v2.1_NCBInr_homology.xlsx

 

Functional Annotation
G. raimondii proteins were analyzed using InterProScan and the KEGG Automated  Annotation Server (KAAS) in order to assign InterPro domains,  Gene Ontology (GO) terms,  KEGG pathways and KEGG orthologs.  This work was performed by the CottonGen Team of Main Bioinformatics Lab at Washington State University.  Term assignments to genes are available in compressed text files and KEGG hier files and maps are available for browsing with the KeggHier tool.
 

 

Downloads
All assembly and annotation files are available for download by selecting the desired data type in the left-hand "Resources" side bar.  Each data type page will provide a description of the available files and links to download.  Alternatively, you can browse all available files on the FTP repository