Gossypium hirsutum (AD1) Genome - Texas Interim release UTX-JGI v1.1

Genome Overview
Analysis NameGossypium hirsutum (AD1) Genome - Texas Interim release UTX-JGI v1.1
Methodgenome sequence (v. 1.1). FALCON
Sourcephytozome
Date performed2017-09-04

Please Note: This genome assembly is made available through a "Reserved Analyses" restriction. Please see the usage policy below for further details.

About the Assembly
This v1.1 annotation release is on genome assembly v1.0, a high-quality version of the Gossypium hirsutum genome sequenced from high-quality large-molecule genomic DNA of G. hirsutum L. acc. TM-1, the same genotype that was used to construct the physical map and sequence BAC ends (Saski et al. in review). The genome was sequenced with a combination of PACBIO RSII long reads and Illumina read pairs. An additional 110 offspring from a TM-1 x 3-79 (Gossypium barbadense L. acc. 3-79) cross were resequenced with Illumina. The release utilized synteny from the published G. hirsutum assembly (Zhang et al. 2015 Nature Biotechnology 33:531) and included integration of BAC end sequences and physical maps (Saski et al. 2017, in revision) and additional screening of small repetitive scaffolds.

Summary
Scaffold total 5,355
Contig total 13,583
Scaffold sequence total 2,341.9 Mb
Contig sequence total 2,259.6 Mb ( -> 3.5% gap)
Scaffold N50 (L50) 11 (90.4Mb)
Contig N50 (L50) 1,514 (389.4kb)

* 636 scaffolds are > 50kb in size, representing approximately 95.9% of the genome

 

Sequencing, Assembly, and Annotation

How was the assembly generated?
This release is a high-quality version of the Gossypium hirsutum genome. It was sequenced with a 77.79x PACBIO (mean read size 9.6 Kb), and 54x Illumina 2x250 800bp insert size in order to polish residual homozygous snps/indels. The main assembly was performed using FALCON. The resulting assembly was then polished using Quiver. A total of 148,239 unique, non-repetitive, non-overlapping 1 KB sequences were generated using the published G. hirsutum assembly and aligned to the polished PacBio G. hirsutum assembly. Additionally, 4,920,681 51mer markers were generated from a TM-1 x 3-79 cross and were used to haplotype 110 offspring. These markers were also aligned to the polished G. hirsutum assembly. Scaffold breaks were identified in two ways: (a) as an abrupt change in the G. hirsutum linkage group based on synteny with the published assembly, and (b) abrupt change in haplotype for a majority of offspring based on the markers calls. A total of 1,748 breaks were made. Scaffolds were then oriented, ordered, and assembled into 26 chromosomes using synteny and the 51mer markers. A total of 8,228 joins were made during this process. Finally, Homozygous SNPs and INDELs were corrected in the release sequence using ~54x of Illumina reads (2x250, 800 bp insert).

Completeness of the euchromatic portion of the genome assembly was assessed by aligning genes from the v2.0 G. raimondii annotated genes. In the case of alternative splicing, the longest alternative splice was selected. The alignments were screened, and alignments less than 90% identity and 85% coverage were excluded. This is a routine test to determine whether we are missing significant portions of the genome. The final results are given below:

  • 37,223 total sequences.
  • 36,832 sequences (99.7%) placed at 90% identity and 85% coverage.
  • 352 seqeunces (0.24%) placed at less than 90% identity and 85% coverage.
  • 39 sequences (0.06%) are not found.

References:

Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. http://nar.oupjournals.org/cgi/content/full/31/19/5654 [Nucleic Acids Res, 31, 5654-5666].

Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2011 .

Yeh, R.-F., Lim, L. P., and Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816.

Salamov, A. A. and Solovyev, V. V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res 10, 516-22.

 

Restrictions on dataset usage
Gossypium hirsutum genome v1.1 data is made available before scientific publication according to the Ft. Lauderdale Accord. By accessing these data, you agree not to publish any articles containing analyses of genes or genomic data on a whole genome or chromosome scale prior to publication by principal investigators of a comprehensive genome analysis without the consent of project's investigators listed in Contacts below. ("Reserved Analyses"). "Reserved analyses" include the identification of complete (whole genome) sets of genomic features such as genes, gene families, regulatory elements, repeat structures, GC content, or any other genome feature, and whole-genome- or chromosome- scale comparisons with other species. The embargo on publication of Reserved Analyses by researchers outside of the Gossypium hirsutum Genome Sequencing Project is expected to extend until the publication of the results of the sequencing project is accepted. Studies of any type on the reserved data sets that are not in direct competition with those planned by the principle investigators may also be undertaken after an agreement with project's principle investigators. The assembly and sequence data should not be redistributed or repackaged without permission from the project's principle investigators.

We request that potential users of this sequence assembly contact the individuals listed under Contacts with their plans to ensure that proposed usage of sequence data are not considered Reserved Analyses.
 

Contacts
Principal Investigators:
Z. Jeffrey Chen (University of Texas at Austin) (email: zjchen AT austin DOT utexas DOT edu)
Jane Grimwood (HudsonAlpha Institute for Biotechnology) (email: jgrimwood AT hudsonalpha DOT org)

Funding
The work is supported by funding from the National Science Foundation (IOS1444552) to Z. Jeffrey Chen (PI), Jane Grimwood (co-PI), Chris Saski (Co-PI), Brian Scheffler (Co-PI), and David Stelly (Co-PI), from USDA-ARS (6402-21310-004-11S and 6402-21310-004) to Daniel Peterson and Brian Scheffler, and from Cotton Incorporated (13-694, 13-965 and 14-371) to David Stelly, Jeremy Schmutz, and Z. Jeffrey Chen.

 

Assembly

The assembly (chromosomes and scaffolds) for Gossypium hirsutum (AD1) Genome Tx-JGI v1.1 is also available from Phytozome.

Assembly (chromosomes & scaffolds) (FASTA format) Tx-JGI_G.hirsutum_v1.1.fa.gz
Assembly (hard masked)  (FASTA format) Tx-JGI_G.hirsutum_v1.1.hardmasked.fa.gz
Assembly (soft masked)  (FASTA format) Tx-JGI_G.hirsutum_v1.1.softmasked.fa.gz
Genes
Predicted genes (GFF3 format) Tx-JGI_G.hirsutum_v1.1.gene.gff3.gz
Predicted genes with exons (GFF3 format) Tx-JGI_G.hirsutum_v1.1.gene_exons.gff3.gz
Transcript sequences (FASTA format) Tx-JGI_G.hirsutum_v1.1.transcript.fa.gz
Primary transcript sequences (FASTA format) Tx-JGI_G.hirsutum_v1.1.transcript_primaryTranscriptOnly.fa.gz
Coding sequences (CDS) (FASTA format) Tx-JGI_G.hirsutum_v1.1.cds.fa.gz
Coding sequences (CDS) for primary transcripts (FASTA format) Tx-JGI_G.hirsutum_v1.1.cds_primaryTranscriptOnly.fa.gz
Protein sequences (FASTA format) Tx-JGI_G.hirsutum_v1.1.protein.fa.gz
Protein sequences from primary transcripts only (FASTA format) Tx-JGI_G.hirsutum_v1.1.protein_primaryTranscriptOnly.fa.gz
Gene prediction
Assemblies of 169,999 RNA-seq transcripts were constructed from about ~280M pairs of stranded paired-end 150 bp Illumina RNA-seq reads from 13 tissues (leaf, stem, root, fiber, ovule, cotyledon, hypocotyl, petal, meristem, pistil, stamen, exocarp, immature cotton squares) of Gossypium hirsutum using PERTRAN (Shu et. Al., unpublished, using GSNAP as read aligner). A total of 142,414 transcript assemblies were constructed using PASA (Haas, 2003) from RNA-seq and 507,810 EST data. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from Arabidopsis, grape, cacao, soybean, poplar, rice and Brachypodium genomes and Swiss-Prot proteomes to the repeat-soft-masked G. hirsutum genome using RepeatMasker (Smit, 1996-2012). Gene models were predicted by homology-based predictors, mainly FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan (Yeh, 2001). The best-scored predictions for each locus were selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to abovementioned proteomes to obtain a Cscore and protein coverage. A Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore was larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it had EST coverage, but its CDS overlapping with repeats was less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed. Final gene set has 66,577 protein-coding genes and 87,800 protein coding transcripts.
Protein Homology
Protein homology was performed by the CottonGen Team of Main Bioinformatics Lab at WSU. Transcripts from the G. hirsutum Tx-JGI v1.1 assembly were mapped against proteins from other genomes and databases using blastx with an e-value cutoff of 1e-6 for TAIR10 and 1e-9 for Swissrot and TrEMBL databases. Only the best match was kept. The available files are in Excel 2007 format.
 
Arabidopsis thaliana TAIR10 proteins G.hirsutum_Tx-JGI_v1.1_vs_TAIR10.xlsx
Uniprot SwissProt proteins G.hirsutum_Tx-JGI_v1.1_vs_SwissProt.xlsx
Uniprot TrEMBL proteins G.hirsutum_Tx-JGI_v1.1_vs_TrEMBL.xlsx
Downloads

Assembly and annotation files are available for download by selecting the desired data type in the left-hand "Resources" side bar. Each data type page will provide a description of the available files and links to download.  Alternatively, you can browse all available files on the FTP repository.

Functional Annotation

Functional annotation performed by JGI

PFAM/Panther/KOG/KEGG/GO and TAIR10 homology G.hirsutum_Tx-JGI_v1.1.annotation_info.txt.gz

 

Functional annotationperformed by the CottonGen Team of the Main Bioinformatics Lab at WSU

Interpro Analysis G.hirsutum_Tx-JGI_v1.1_interpro.txt.gz
GO annotation G.hirsutum_Tx-JGI_v1.1_genes2GO.txt.gz
IPR Terms G.hirsutum_Tx-JGI_v1.1_genes2IPR.txt.gz
KEGG Orthologs G.hirsutum_Tx-JGI_v1.1_KEGG-orthologs.txt.gz
KEGG Pathways G.hirsutum_Tx-JGI_v1.1_KEGG-pathways.txt.gz
Markers
Marker alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map marker sequences from CottonGen to the G. hirsutum genome assembly. Markers required 90% identity over 97% of their length. For SSRs & RFLPs, gap size was restricted to 1000bp or less with less than 2 gaps. For dbSNPs and Indels gap size was restricted to 2bp with less than 2 gaps. The available files are in GFF3 format. Markers available in CottonGen are linked to JBrowse.
 
CottonGen SNP-InDel markers mapped to genome G.hirsutum_Tx-JGI_v1.1_SNP-InDel
CottonGen RFLP markers mapped to genome G.hirsutum_Tx-JGI_v1.1_RFLP
CottonGen SSR markers mapped to genome G.hirsutum_Tx-JGI_v1.1_SSR
Transcript Alignments
Transcript alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map transcripts to the G. hirsutum genome assembly. Alignments with an alignment length of 97% and 90% identify were preserved. The available files are in GFF3 format.

 

G. arboreum CottonGen RefTrans v1 G.hirsutum_Tx-JGI_v1.1_g.arboreum_cottongen_reftransV1
G. barbadense CottonGen RefTrans v1 G.hirsutum_Tx-JGI_v1.1_g.barbadense_cottongen_reftransV1
G. hirsutum CottonGen RefTrans v1 G.hirsutum_Tx-JGI_v1.1_g.hirsutum_cottongen_reftransV1
G. raimondii CottonGen RefTrans v1 G.hirsutum_Tx-JGI_v1.1_g.raimondii_cottongen_reftransV1
Repeats
Repeats for the G. hirsutum Tx-JGI v1.1 annotations are in GFF3 format.  To obtain masked version of the assembled chromosomes and scaffolds, click the 'Assembly' link in the left sidebar.
 
Repeats identified using RepeatMasker Tx-JGI_G.hirsutum_v1.1.repeats.gff3.gz