Gossypium hirsutum (AD1) Genome CGP-BGI Assembly v1.0 & Annotation v1.0

Genome Overview
MethodSOAPdenovo (na)
SourceIllumina HiSeq 2000 reads from various insert size libraries (CGP-BGI)
Date performed2015-04-20

About the assembly
The allotetraploid genome of Upland cotton G. hirsutum has been estimated, using various methods, as 2.25–2.43 Gb. A total of 445.7 Gb, or 181-fold haploid genome coverage, of raw paired-end Illumina reads by sequencing whole genome shotgun (WGS) libraries of homozygous cv. 'TM-1' with fragment lengths ranging from 250 bp to 40 kb was generated. Owing to the existence of abundant repetitive sequences and homeologous chromosomes, Assembling this allotetraploid genome satisfactorily using only the WGS data was not possible. Supplemental use of a bacterial artificial chromosome (BAC-to-BAC) sequencing strategy substantially improved the assembly. A total of 100,187 BACs, that corresponded to about fivefold genome coverage, were sequenced and used in the final assembly. Each BAC was assembled individually before genome assembly. Genome assembly used sequenced BACs and paired-end data. A total of 2,173 Mb of the G. hirsutum genome sequence was assembled using SOAPdenovo, with the largest scaffold being 8.4 Mb. This corresponds to 96.7% of the previous estimation of nuclear DNA content13, or 89.6% according to a more recent report14. The N50 (the size above which 50% of the total length of the sequence assembly can be found) of the contigs and scaffolds was 80 kb and 764 kb, respectively, which was better than the assembly that used WGS data only (N50 of contigs and scaffolds was 20 kb and 107 kb, respectively.

Category Number N50 (kb) Longest (kb) Size (Mb) Percent of the assembly
Total contigs 44,816 80 784 2,090
Total scaffolds 8,591 764 8,400 2,173 100.0
Anchored and oriented scaffolds 4,023 853 8,400 1,923 88.5
Genes annotated 76,943     220 9.5
miRNAs 602     0.07 <0.01
rRNAs 2,153     0.6 <0.01
tRNAs 2,050     0.2 0.01
snRNAs 8,325     0.9 0.04
Repeat sequences     1,471 67.2

Li et. al., Genome sequence of cultivated Upland cotton (Gossypium hirsutum TM-1) provides insights into genome evolution Nature Biotechnology. 33, 524–530. 2015




The chromosomes (pseudomolecules) and scaffolds for Gossypium hirsutum (AD1) Genome CGP-BGI Assembly v1.0

Assembly pseudomolecules (FASTA format) BGI_Gossypium_hirsutum_v1.0.gz
Assembly pseudomolecules (GFF3format) BGI_Gossypium_hirsutum_v1.0.gff.gz



The predicted genes and proteins for Gossypium hirsutum (AD1) Genome CGP-BGI Assembly v1.0

Predicted genes coding sequences (CDS)  (FASTA format) BGI_Gossypium_hirsutum_v1.0.cds.gz
Predicted genes proteins (FASTA format) BGI_Gossypium_hirsutum_v1.0.pep.gz
Predicted genes and CDS (GFF3 format) BGI_Gossypium_hirsutum_v1.0.cds.gff.gz


Marker alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map marker sequences from CottonGen to the G. hirsutum genome assembly. Markers required 90% identity over 97% of their length. For SSRs & RFLPs, gap size was restricted to 1000bp or less with less than 2 gaps. For dbSNPs and Indels gap size was restricted to 2bp with less than 2 gaps. The available files are in GFF3 format. Markers available in CottonGen and CMap are linked to JBrowse.
CottonGen SNP markers mapped to genome G.hirsutum_CGP-BGI_v1.0_SNP.gff3.gz
CottonGen RFLP markers mapped to genome G.hirsutum_CGP-BGI_v1.0_RFLP.gff3.gz
CottonGen SSR markers mapped to genome G.hirsutum_CGP-BGI_v1.0_SSR.gff3.gz
CottonGen InDel markers mapped to genome G.hirsutum_CGP-BGI_v1.0_CTG_Indels.gff3.gz
NCBI Cotton dbSNPs mapped to genome G.hirsutum_CGP-BGI_v1.0_dbSNP.gff3.gz


Transcript Alignments
Transcript alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map transcripts to the G. hirsutum genome assembly. Alignments with an alignment length of 97% and 90% identify were preserved. The available files are in GFF3 format.


dbEST for all Gossypium G.hirsutum_CGP-BGI_v1.0_dbEST.gff3.gz
dbEST for G. arboreum G.hirsutum_CGP-BGI_v1.0_dbEST_G-arboreum.gff3.gz
dbEST for G. barbadense G.hirsutum_CGP-BGI_v1.0_dbEST_G-barbadense.gff3.gz
dbEST for G. herbaceum G.hirsutum_CGP-BGI_v1.0_dbEST_G-herbaceum.gff3.gz
dbEST for G. hirsutum G.hirsutum_CGP-BGI_v1.0_dbEST_G-hirsutum.gff3.gz
dbEST for G. raimondii G.hirsutum_CGP-BGI_v1.0_dbEST_G-raimondii.gff3.gz
Unigene from NCBI for all Gossypium G.hirsutum_CGP-BGI_v1.0_NCBI-unigenes.gff3.gz
Unigene from NCBI for  G. hirsutum G.hirsutum_CGP-BGI_v1.0_NCBI_G-hirsutum-unigene.gff3.gz
Unigene from NCBI for  G. raimondii G.hirsutum_CGP-BGI_v1.0_NCBI_G-raimondii-unigene.gff3.gz
CottonGen unigene v1.0 G.hirsutum_CGP-BGI_v1.0_cottongen-unigenes.gff3.gz
J. Udall 2012 Unigene contigs G.hirsutum_CGP-BGI_v1.0_Udall2012-unigenes.gff3.gz
PlantGDB Gossypium unigenes G.hirsutum_CGP-BGI_v1.0_PlantGDB_Gossypium.gff3.gz
PlantGDB G. arboreum unigenes G.hirsutum_CGP-BGI_v1.0_PlantGDB_arboreum.gff3.gz
PlantGDB G. barbadense unigenes G.hirsutum_CGP-BGI_v1.0_PlantGDB_barbadense.gff3.gz
PlantGDB G. hirsutum unigenes G.hirsutum_CGP-BGI_v1.0_PlantGDB_hirsutum.gff3.gz
PlantGDB G. raimondii unigenes G.hirsutum_CGP-BGI_v1.0_PlantGDB_raimondii.gff3.gz
TIGR CGI unigenes G.hirsutum_CGP-BGI_v1.0_TIGR_CGI.gff3.gz


Protein Alignments
Protein alignments available below were performed by the CottonGen Team of the Main Bioinformatics Lab at WSU.  The alignment tool 'exonerate' was used to map protein sequences onto the G. hirsutum NBI v1.0 genome. Only alignments with a percent identity of 90% were retained.


Protein Homology
Protein homology was performed by the CottonGen Team of Main Bioinformatics Lab at WSU. Proteins from the G. hirsutum assembly were mapped against proteins from other genomes and databases using blastp with an e-value cutoff of 1e-6. Only the best match was kept. The available files are in Excel 2007 format.
Cacao theobroma v1.0 proteins G.hirsutum_CGP-BGI_v1.0_vs_Cacao.xlsx
Arabidopsis thaliana TAIR10 proteins G.hirsutum_CGP-BGI_v1.0_vs_TAIR10.xlsx
Oryza sativa MSU v7.0 proteins G.hirsutum_CGP-BGI_v1.0_vs_Rice.xlsx
Poplar trichocarpa v2.0 proteins G.hirsutum_CGP-BGI_v1.0_vs_Poplar.xlsx
Vitis vinifera proteins G.hirsutum_CGP-BGI_v1.0_vs_Grape.xlsx
Glycine max v1.0 proteins G.hirsutum_CGP-BGI_v1.0_vs_Soybean.xlsx
Uniprot SwissProt proteins G.hirsutum_CGP-BGI_v1.0_vs_SwissProt.xlsx
Uniprot TrEMBL proteins G.arboreum_CGP-BGI_v1.0_vs_TrEMBL.xlsx
NCBI nr proteins G.arboreum_CGP-BGI_v1.0_vs_nr.xlsx


Authors  Fuguang Li, Guangyi Fan, Cairui Lu, Guanghui Xiao, Changsong Zou, Russell J Kohel, Zhiying Ma, Haihong Shang, Xiongfeng Ma, Jianyong Wu, Xinming Liang, Gai Huang, Richard G Percy, Kun Liu, Weihua Yang, Wenbin Chen, Xiongming Du, Chengcheng Shi, Youlu Yuan, Wuwei Ye, Xin Liu, Xueyan Zhang, Weiqing Liu, Hengling Wei, Shoujun Wei, Guodong Huang, Xianlong Zhang, Shuijin Zhu, He Zhang, Fengming Sun, Xingfen Wang, Jie Liang, Jiahao Wang, Qiang He, Leihuan Huang, Jun Wang, Jinjie Cui, Guoli Song, Kunbo Wang, Xun Xu, John Z Yu, Yuxian Zhu & Shuxun Yu
Title  Genome sequence of cultivated Upland cotton (Gossypium hirsutum TM-1) provides insights into genome evolution
Journal Nature Biotechnology
Issue 33
Pages 524-530
Year 2015


Functional Annotation
Functional annotation for Gossypium hirsutum (AD1) Genome BGI Assembly v1.0 (Performed by BGI)
GO Annotation BGI_Gossypium_hirsutum_gene.GO.result.gz
Interpro Annotation BGI_Gossypium_hirsutum_gene.InterPro.result.gz
KEGG Annotation BGI_Gossypium_hirsutum_gene.KEGG.result.gz
Swissprot Annotation BGI_Gossypium_hirsutum_gene.Swissprot.result.gz
TrEMBL Annotation BGI_Gossypium_hirsutum_gene.TrEMBL.result.gz


Functional annotation for Gossypium hirsutum (AD1) Genome BGI Assembly v1.0 (Performed by the CottonGen Team of the Main Bioinformatics Lab at WSU.)


Interpro Analysis G.hirsutum_BGI_v1.0_interpro.txt.gz
GO annotation G.hirsutum_BGI_v1.0_genes2GO.txt.gz
IPR Terms G.hirsutum_BGI_v1.0_genes2IPR.txt.gz
KEGG Orthologs G.hirsutum_BGI_v1.0_KEGG.orthologs.txt.gz
KEGG Pathways G.hirsutum_BGI_v1.0_KEGG.pathways.txt.gz

All assembly and annotation files are available for download by selecting the desired data type in the left-hand "Resources" side bar.  Each data type page will provide a description of the available files and links to download.  Alternatively, you can browse all available files on the FTP repository