Gossypium raimondii (D5) genome JGI_v2_a2.1

Analysis NameGossypium raimondii (D5) genome JGI_v2_a2.1
MethodpathwaysArachne2 (modified)
SourceSanger based sequence, Roche 454, and Illumina based short reads
Date performed2013-02-18

The following text comes from phytozome.org:

This v2.1 annotation release is on genome assembly v2.0, a high quality version of the Cotton D (Gossypium raimondii) genome sequenced from DNA provided by Andrew Paterson at Univ. GA. It was sequenced with a combination of Sanger, Roche 454 pyrosequencing and Illumina read pairs. This release includes additional screening of small repetitive contigs and a new map integration that corrects several orientation issues within scaffolds.


 Assembly Summary
 Scaffold total 1,033
 Contig total 19,735
 Scaffold sequence total 761.4 Mb
 Contig sequence total 748.1 Mb ( -> 1.7% gap)
 Scaffold N50 (L50) 6 (62.2 Mb)
 Contig N50 (L50) 1,596 (135.6kb)
 41 scaffolds are > 50kb in size, representing approximately 99.0% of the genome

Assembly Details

This release is a high quality version of the Cotton D genome from DNA provided by Andrew Paterson at Univ. GA. It was sequenced with a combination of Sanger based sequence (1.52x assembled coverage with 0.95x coverage from BAC end sequence and fosmids end sequence) Roche 454 pyrosequencing (14.95x linear and 3.1x non-redundant pairs assembled coverage), and Illumina based short reads (primarily to correct 454 insertion/deletion errors) and assembled the genome using our modified version of Arachne2.

V2.0 includes the removal of small repetitive contigs within scaffolds and a new detailed map integration with the recently available tetraploid map that was used to correct several scaffold orientation issues. The combination of:
  •         BES/markers hybridized to FPC contigs (Lin et al, 2010)
  •         Genetic map for the diploid (Rong, et al 2004)
  •         Tetraploid map (Byers et al, 2012)
  •         Vitis vinifera and Theobroma cacao synteny
was used to identify misjoins in the assembly. Misjoins were characterized by a combination of an abrupt change in the linkage group (or synteny) within a region of low BAC/Fosmid support. A total of 13 misjoins were identified and subsequently broken.

Scaffolds were oriented, ordered, and joined together using the aforementioned resources. A total of 51 joins were assembled to form a final assembly containing 13 chromosomes. This release is of suitably high quality to match our previous fully Sanger sequenced plant genomes.

Gene Predictions

85,746 transcript assemblies were made from about 1B pairs of D5 paired-end Illumina RNAseq reads, 55,294 transcript assemblies about 0.25B D5 single end Illumina RNAseq reads, 62,526 transcript assemblies from 0.15B TET single end Illumina RNAseq reads. All these transcript assemblies from RNAseq reads were made using PERTRAN (Shu et. al., manuscript in preparation). 120,929 transcript assemblies were constructed using PASA (Haas, 2003) from 56,638 D5 Sanger ESTs, 2.5M D5 454 RNAseq reads and all RNAseq transcript assemblies above. 133,073 transcript assemblies were constructed using PASA from 296,214 TET Sanger ESTs and about 2.9M TET 454 reads. The larger number of transcript asssemblies from fewer TET sequences is due to fragment nature of the assemblies. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabi (Arabidopsis thaliana), cacao, rice, soybean, grape and poplar proteins to repeat-soft-masked G. raimnondaii genome using RepeatMasker (Smit, 1996-2012) with up to 2K BP extension on both ends unless extending into another locus on the same strand.

Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan (Yeh, 2001). The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed. Final gene set has 37,505 protein coding genes and 77,267 protein coding transcripts.

  • Paterson AH et al., "Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres.", Nature, 2012 Dec 20;492(7429):423-7
  • Xu Q et al., "Analysis of complete nucleotide sequences of 12 Gossypium chloroplast genomes: origin and evolution of allotetraploids.", PLoS One, 2012;7(8):e37128
  • Lowe TM et al., "tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.", Nucleic Acids Res, 1997 Mar 1;25(5):955-64
  • Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. http://nar.oupjournals.org/cgi/content/full/31/19/5654 [Nucleic Acids Res, 31, 5654-5666].
  • Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2011 .
  • Yeh, R.-F., Lim, L. P., and Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816.
  • Salamov, A. A. and Solovyev, V. V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res 10, 516-22.

The chromosomes (pseudomolecules) and scaffolds for the JGI G. raimondii assembly.  These files belong to the G. raimondii JGI v2.0 assembly

Chromosomes & scaffolds (FASTA format) G.raimondii_JGI_221_v2.0.assembly.fasta.gz
Chromosomes & scaffolds (GFF3 format) G.raimondii_JGI_221_v2.0.assembly.gff3.gz
Assembly GAPs (GFF3 format) G.raimondii_JGI_221_v2.0.gaps.gff3.gz
Chromosomes only (FASTA format) G.raimondii_JGI_221_v2.0.psuedomolecules.fasta.gz
Scaffolds only (FASTA format) G.raimondii_JGI_221_v2.0.scaffolds.fasta.gz
Repeats masked by N's (FASTA format) G.raimondii_JGI_221_v2.0.assembly-hardmasked.fasta.gz
Repeats masked by lower case (FASTA format) G.raimondii_JGI_221_v2.0.assembly-softmasked.fasta.gz
All assembly and annotation files are available for download by selecting the desired data type in the left-hand side bar.  Each data type page will provide a description of the available files and links to download.
Functional Annotation
G. raimondii proteins were analyzed using InterProScan and the KEGG Automated  Annotation Server (KAAS) in order to assign InterPro domains,  Gene Ontology (GO) terms,  KEGG pathways and KEGG orthologs.  This work was performed by the CottonGen Team of Main Bioinformatics Lab at Washington State University.  Term assignments to genes are available in compressed text files and KEGG hier files and maps are available for browsing with the KeggHier tool.



The predicted gene model, their alignments and proteins for  G. raimondii  genome.  These files belong to the G. raimondii JGI v2.1 annotation set.

Predicted gene models with exons (GFF3 format) G.raimondii_JGI_221_v2.1.transcripts_exons.gff3.gz
Predicted gene models no exons (GFF3 format) G.raimondii_JGI_221_v2.1.transcripts.gff3.gz
mRNA sequences (FASTA format) G.raimondii_JGI_221_v2.1.transcripts.fasta.gz
Coding sequences, CDS (FASTA format) G.raimondii_JGI_221_v2.1.CDS.fasta.gz
Protein sequences (FASTA format) G.raimondii_JGI_221_v2.1.proteins.fasta.gz
Marker alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map marker sequences from CottonGen to the G. raimondii genome assembly. Markers required 90% identity over 97% of their length. For SSRs & RFLPs, gap size was restricted to 1000bp or less with less than 2 gaps. For dbSNPs and Indels gap size was restricted to 2bp with less than 2 gaps. The available files are in GFF3 format. Markers available in CottonGen and CMap are linked to JBrowse.
CottonGen SNP markers mapped to genome G.raimondii_JGI_221_v2.1_SNP
CottonGen RFLP markers mapped to genome G.raimondii_JGI_221_v2.1_RFLP
CottonGen SSR markers mapped to genome G.raimondii_JGI_221_v2.1_SSR
CottonGen InDel markers mapped to genome G.raimondii_JGI_221_v2.1_Indels
Protein Alignments
Protein alignments available below were performed by the CottonGen Team.  The alignment tool 'exonerate' was used to map protein sequences onto the G. raimondii   JGI v2.0 genome.  Only alignments with a percent identity of 90% were retained.


Protein Homology
Homology of the Gossypium.arboreum JGI_221_v2.1 transcripts was determined by pairwise sequence comparison using the blastx algorithm against various protein databases. An expectation value cutoff less than 1e-9 was used for the NCBI nr (Release 2018-05) and 1e-6  for the Arabidoposis proteins (Araport11), UniProtKB/SwissProt (Release 2019-01), and UniProtKB/TrEMBL (Release 2019-01) databases. The best hit reports are available for download in Excel format. 
G.arboreum JGI_221_v2.1 transcripts with NCBI nr homologs (EXCEL file) G.raimondii_JGI_221_v2.1_vs_nr.xlsx.gz
G.arboreum JGI_221_v2.1 transcripts with NCBI nr (FASTA file) G.raimondii_JGI_221_v2.1_vs_nr_hit.fasta.gz
G.arboreum JGI_221_v2.1 transcripts without NCBI nr (FASTA file) G.raimondii_JGI_221_v2.1_vs_nr_noHit.fasta.gz
G.arboreum JGI_221_v2.1 transcripts with arabidopsis (Araport11) homologs (EXCEL file) G.raimondii_JGI_221_v2.1_vs_tair.xlsx.gz
G.arboreum JGI_221_v2.1 transcripts with arabidopsis (Araport11) (FASTA file) G.raimondii_JGI_221_v2.1_vs_tair_hit.fasta.gz
G.arboreum JGI_221_v2.1 transcripts without arabidopsis (Araport11) (FASTA file) G.raimondii_JGI_221_v2.1_vs_tair_noHit.fasta.gz
G.arboreum JGI_221_v2.1 transcripts with SwissProt homologs (EXCEL file) G.raimondii_JGI_221_v2.1_vs_swissprot.xlsx.gz
G.arboreum JGI_221_v2.1 transcripts with SwissProt (FASTA file) G.raimondii_JGI_221_v2.1_vs_swissprot_hit.fasta.gz
G.arboreum JGI_221_v2.1 transcripts without SwissProt (FASTA file) G.raimondii_JGI_221_v2.1_vs_swissprot_noHit.fasta.gz
G.arboreum JGI_221_v2.1 transcripts with TrEMBL homologs (EXCEL file) G.raimondii_JGI_221_v2.1_vs_trembl.xlsx.gz
G.arboreum JGI_221_v2.1 transcripts with TrEMBL (FASTA file) G.raimondii_JGI_221_v2.1_vs_trembl_hit.fasta.gz
G.arboreum JGI_221_v2.1 transcripts without TrEMBL (FASTA file) G.raimondii_JGI_221_v2.1_vs_trembl_noHit.fasta.gz
Repeats for the G. raimondii JGI v2.1 annotations are in GFF3 format.  To obtain masked version of the assembled chromosomes and scaffolds, click the 'Assembly' link in the right sidebar.
Repeats identified using RepeatMasker G.raimondii_JGI_221_v2.1.repeats.gff3.gz


Alignments for NCBI dbSNPs and CottonGen SNPs/Indels were performed by the CottonGen Team.  The alignment tool 'BLAT' was used to map CottonGen marker sequences to the G. raimondii genome assembly.  Markers required 90% identify over 97% of their length. 

SNPs from Joshua Udall at Brigham Young University are also provided below.  These SNPs represent differences between the diploid genomes of A and D and  differences between At and Dt. Each are provided in separate files available in GFF3 format. The former collection of diploid SNPs were derived from four D-genomes, two A1 genome and three A2 genomes. The tetraploid SNPs were derived resequencing data of the cultivar Maxxa as published with the D-genome (Paterson et al. 2012). The SNPs were identified by 1) mapping the sequence reads with GSNAP to the D-genome reference sequence, 2) categorizing the reads using PolyCat, and 3) discovery of the SNPs using InnerSNP.  A paper that fully describes the sequences, SNP identification process, and differences between the diploid genomes will be published soon.

NCBI cotton dbSNPs mapped to genome G.raimondii_JGI_221_v2.1_dbSNP.gff3.gz
CottonGen SNP markers mapped to genome G.raimondii_JGI_221_v2.1_CTG_SNPs.gff3.gz
CottonGen InDel markers mapped to genome G.raimondii_JGI_221_v2.1_CTG_Indels.gff3.gz
Joshua Udalls A/D (diploid) SNPs G.raimondii_JGI_211_v2.1_Udall_di_SNPs.gff3.gz
Joshua Udall's At/Dt (tetraploid) SNPs G.raimondii_JGI_211_v2.1_Udall_tetra_SNPs.gff3.gz


Transcript Alignments
Transcript alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map transcripts to the G. raimondii genome assembly. Alignments with an alignment length of 97% and 98% identify were preserved. The available files are in GFF3 format.


G. arboreum CottonGen RefTrans v1 G.raimondii_JGI_221_v2.1_g.arboreum_cottongen_reftransV1
G. barbadense CottonGen RefTrans v1 G.raimondii_JGI_221_v2.1_g.barbadense_cottongen_reftransV1
G. hirsutum CottonGen RefTrans v1 G.raimondii_JGI_221_v2.1_g.hirsutum_cottongen_reftransV1
G. raimondii CottonGen RefTrans v1 G.raimondii_JGI_221_v2.1_g.raimondii_cottongen_reftransV1