Gossypium raimondii (D5) genome JGI assembly v2.0 (annot v2.1)
Please Note: This genome assembly is made available through a "Reserved Analysis" restriction. Please see the usage policy below for further details.
This release is a high quality version of the Cotton D genome from DNA provided by Andrew Paterson at Univ. GA. It was sequenced with a combination of Sanger based sequence (1.52x assembled coverage with 0.95x coverage from BAC end sequence and fosmids end sequence) Roche 454 pyrosequencing (14.95x linear and 3.1x non-redundant pairs assembled coverage), and Illumina based short reads (primarily to correct 454 insertion/deletion errors) and assembled the genome using our modified version of Arachne2.
V2.0 includes the removal of small repetitive contigs within scaffolds and a new detailed map integration with the recently available tetraploid map that was used to correct several scaffold orientation issues. The combination of:
was used to identify misjoins in the assembly. Misjoins were characterized by a combination of an abrupt change in the linkage group (or synteny) within a region of low BAC/Fosmid support. A total of 13 misjoins were identified and subsequently broken.
Scaffolds were oriented, ordered, and joined together using the aforementioned resources. A total of 51 joins were assembled to form a final assembly containing 13 chromosomes. This release is of suitably high quality to match our previous fully Sanger sequenced plant genomes.
85,746 transcript assemblies were made from about 1B pairs of D5 paired-end Illumina RNAseq reads, 55,294 transcript assemblies about 0.25B D5 single end Illumina RNAseq reads, 62,526 transcript assemblies from 0.15B TET single end Illumina RNAseq reads. All these transcript assemblies from RNAseq reads were made using PERTRAN (Shu et. al., manuscript in preparation). 120,929 transcript assemblies were constructed using PASA (Haas, 2003) from 56,638 D5 Sanger ESTs, 2.5M D5 454 RNAseq reads and all RNAseq transcript assemblies above. 133,073 transcript assemblies were constructed using PASA from 296,214 TET Sanger ESTs and about 2.9M TET 454 reads. The larger number of transcript asssemblies from fewer TET sequences is due to fragment nature of the assemblies. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabi (Arabidopsis thaliana), cacao, rice, soybean, grape and poplar proteins to repeat-soft-masked G. raimnondaii genome using RepeatMasker (Smit, 1996-2012) with up to 2K BP extension on both ends unless extending into another locus on the same strand.
Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan (Yeh, 2001). The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed. Final gene set has 37,505 protein coding genes and 77,267 protein coding transcripts.
For public access, in agreement with Fort Lauderdale, we (JGI) are making the Cotton D genome available from the DOE JGI and our collaborators prior to peer-reviewed publication of the data. We are making this data available with the expectation and desire to publish this data in a reasonable time without preemption by other groups. By accessing these data, you agree not to publish any articles containing analyses of genes or genomic data on a whole genome or chromosome scale prior to publication by the DOE JGI and/or its collaborators of a comprehensive genome analysis ("Reserved Analyses"). "Reserved analyses" include the identification of complete (whole genome) sets of genomic features such as genes, gene families, regulatory elements, repeat structures, GC content, or any other genome feature, and whole-genome- or chromosome- scale comparisons with other species including other cotton species and cultivars. For specific questions about data use please contact Andy Paterson (paterson AT plantbio.uga.edu) and Jeremy Schmutz (jschmutz AT hudsonalpha.org).
The chromosomes (pseudomolecules) and scaffolds for the JGI G. raimondii assembly. These files belong to the G. raimondii JGI v2.0 assembly
The predicted gene model, their alignments and proteins for G. raimondii genome. These files belong to the G. raimondii JGI v2.1 annotation set.
Repeats for the G. raimondii JGI v2.1 annotations are in GFF3 format. To obtain masked version of the assembled chromosomes and scaffolds, click the 'Assembly' link in the right sidebar.
Marker alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map marker sequences from CottonGen to the G. raimondii genome assembly. Markers required 90% identify over 97% of their length. For SSRs & RFLPs, gap size was restricted to 1000bp or less with less than 2 gaps. For dbSNPs, CottonGen SNPs and Indels gap size was restricted to 2bp with less than 2 gaps. The available files are in GFF3 format. Markers available in CottonGen and CMap are linked to GBrowse.
Alignments for NCBI dbSNPs and CottonGen SNPs/Indels were performed by the CottonGen Team. The alignment tool 'BLAT' was used to map CottonGen marker sequences to the G. raimondii genome assembly. Markers required 90% identify over 97% of their length.
SNPs from Joshua Udall at Brigham Young University are also provided below. These SNPs represent differences between the diploid genomes of A and D and differences between At and Dt. Each are provided in separate files available in GFF3 format. The former collection of diploid SNPs were derived from four D-genomes, two A1 genome and three A2 genomes. The tetraploid SNPs were derived resequencing data of the cultivar Maxxa as published with the D-genome (Paterson et al. 2012). The SNPs were identified by 1) mapping the sequence reads with GSNAP to the D-genome reference sequence, 2) categorizing the reads using PolyCat, and 3) discovery of the SNPs using InnerSNP. A paper that fully describes the sequences, SNP identification process, and differences between the diploid genomes will be published soon.
With the exception of ESTs mapped by JGI (Phytozome), the transcript alignments available below were performed by the CottonGen Team. The alignment tool 'BLAT' was used to map transcripts to the G. raimondii genome assembly. Alignments with an alignment length of 97% and 90% identify were preserved. The available files are in GFF3 format.
Protein alignments available below were performed by the CottonGen Team. The alignment tool 'exonerate' was used to map protein sequences onto the G. raimondii JGI v2.0 genome. Only alignments with a percent identity of 90% were retained.
Protein homology available below was performed by the CottonGen Team of Main Bioinformatics Lab at WSU. Transcripts from the G. raimondii assembly were mapped against proteins from other genomes and databases using blastx with an e-value cutoff of 1e-6 (1e-9 for TrEMBL and NCBI nr). Only the best match was kept. The available files are in Excel 2007 format.
All assembly and annotation files are available for download by selecting the desired data type in the left-hand "Resources" side bar. Each data type page will provide a description of the available files and links to download. Alternatively, you can browse all available files on the FTP repository.
G. raimondii proteins were analyzed using InterProScan and the KEGG Automated Annotation Server (KAAS) in order to assign InterPro domains, Gene Ontology (GO) terms, KEGG pathways and KEGG orthologs. This work was performed by the CottonGen Team of Main Bioinformatics Lab at Washington State University. Term assignments to genes are available in compressed text files and KEGG hier files and maps are available for browsing with the KeggHier tool.