Gossypium raimondii (D5) genome NSF Assembly v1 & Annotation a1

Overview
Analysis NameGossypium raimondii (D5) genome NSF Assembly v1 & Annotation a1
MethodPacBio Sequel (Canu v. 1.6)
SourceD5.v1.pred.fa (=raimondii.juiced_V1.0.fasta)
Date performed2019-07-03

Material and Methods

DNA was extracted from G. raimondii (accession D5-4) plants using CTAB techniques (Kidwell and Osborn 1992). DNA concentration was measured by a Qubit Fluorometer (ThermoFisher, Inc.). The sequencing library was constructed according to PacBio recommendations at the BYU DNA Sequencing Center (DNASC). Fragments >18 kb were selected for sequencing via BluePippen (Sage Science, LLC). Prior to sequencing, the size distribution of fragments in the libraries was evaluated using a Fragment Analyzer (Advanced Analytical Technologies, Inc). Eight and eleven PacBio cells were sequenced from a single library for G. raimondii, respectively, on the Pacific Biosciences Sequel system. For both genomes, the raw PacBio sequencing reads were assembled using Canu V1.6 using default parameters (Koren et al. 2017).

HiC libraries were constructed from G. raimondii leaf tissue at NorthEast Normal University, China. Sequencing was performed at Annoroad Gene Technology Co., Ltd (Beijing, China). The HiC data of G. raimondii was mapped to the previous genome sequence of G. raimondii using HiC-Pro (Servant et al. 2015), and to the newly assembled CANU contigs of G. raimondii PacBio reads by PhaseGenomics. The HiC interactions were used as evidence for contig proximity and in scaffolding contig sequences. An initial draft genome sequence of pseudochromosomes (PGA assembly) was created using a custom python script from PhaseGenomics.

DNA was also extracted from young G. raimondii leaves following the Bionano Plant protocol for high-molecular weight DNA. DNA was purified, nicked, labeled, and repaired according to Bionano standard operating procedures for the Irys platform. Two optical maps of different enzymes (BspQI and BssSI) were assembled using the IrysSolve pipeline on the BYU Fulton SuperComputing cluster. The optical maps were combined into a two-enzyme composite optical map and it was aligned to the PGA assembly using an in silico labeled reference sequence. Conflicts between the Bionano maps and the PGA assembly were manually identified in the Bionano Access software by comparing the mapped Bionano contigs to the CANU contigs along the draft genome sequence. Conflicts between datasets were resolved by repositioning and reorienting CANU contigs in PGA ordering files followed by reconstruction of the fasta sequence, provided there was supporting or no-conflict evidence from the optical map (Durand et al. 2016). Multiple iterations of mapping, conflict resolution, and draft sequence construction resulted in the final, new genome sequence of G. raimondii.

About the assembly

The G. raimondii genome was assembled from 43.7x PacBio coverage of raw sequence reads. The assembly consisted of 187 contigs with an N50 of 6.3MB (Table 1). The contigs were
scaffolded using HiC by PhaseGenomics and the pseudomolecules were manually adjusted using JuiceBox (Durand et al. 2016). The final scaffolded assembly was independently verified using a composite optical map of two different enzymes. A comparison of assembly metrics between the previous genome sequence and our new genome sequence of G. raimondii illustrates a 45x improvement in contig length and a 97x reduction in the number of gaps. The cumulative gap length of the new assembly (17.6 kb) was reduced by 647x compared to the assembled gaps of the previous genome sequence (11,391 kb). The final genome assembly size was 14.9 MB smaller than the previous assembly, representing 98% of previously assembled genome sequence in length.

Table 1. Assembly metrics of the G. turneri genome, the G. raimondii (our current assembly, D5), and the previous G. raimondii assembly (Paterson et al. 2012).

Summary of Improved Assembly G. turneri (D10-BYU) G. raimondii (D5-BYU current) G. raimondii (D5-JGI 2012)
Contigs 220 187 16,924
Max Contig 23,475,487 24,216,129 1,162,971
Mean Contig 3,432,648 3,929,767 43,597
Contig N50 7,909,293 6,291,832 136,998
Contig N90 1,624,019 2,044,991 32,166
Total Contig Length 755,182,540 734,866,495 737,837,083
Assembly GC 33.21 33.19 33.19
Max Scaffolds 67,704,245 65,701,939 70,713,020
Mean Scaffold 58,092,557 56,529,546 57,632,930
Scaffold N50 60,464,062 58,819,159 62,175,169
Scaffold N90 50,570,303 46,322,098 45,765,648
Total Scaffold Length 755,203,240 734,884,094 749,228,090
Captured Gaps 207 174 16,911
Max Gap 100 200 63,138
Mean Gap 100 101 674
Gap N50 100 100 2,607
Total Gap Length 20,700 17,599 11,391,007

Publication

Udall, et al., 2019. De novo genome sequence assemblies of Gossypium raimondii and G. turneri.

Assembly

The chromosomes (de novo) for G. raimondii 'D5-4' genome. These files belong to the NSF-D5 (also called BYU-D5) Assembly v1.0

Chromosomes (FASTA format) G.raimondii_NSF-D5_assembly_v1.fasta.gz
Chromosomes (GFF3 format) G.raimondii_NSF-D5_assembly_v1.gff3
Genes

The predicted gene model, their alignments and proteins for G. raimondii 'D5-4' genome. These files belong to the NSF-D5 Assembly v1 & Annotation a1

Transcripts sequences (FASTA format) G.raimondii_NSF-D5_v1_a1_transcripts.gz
Protein sequences (FASTA format) G.raimondii_NSF-D5_v1_a1_proteins.gz
Downloads

All annotation files are available for download by selecting the desired data type in the left-hand "Resources" side bar.  Each data type page will provide a description of the available files and links do download.  Alternatively, you can use the FTP repository for bulk download.

Functional Analysis

Functional annotation files for the Gossypium raimondii NSF Genome v1.0 are available for download below. The Gossypium raimondii NSF Genome proteins were analyzed using InterProScan in order to assign InterPro domains and Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS).

Downloads

GO assignments from InterProScan D5_NSF-v1.0_genes2GO.xlsx.gz
IPR assignments from InterProScan D5_NSF-v1.0_genes2IPR.xlsx.gz
Proteins mapped to KEGG Pathways D5_NSF-v1.0_KEGG-orthologis.xlsx.gz
Proteins mapped to KEGG Orthologs D5_NSF-v1.0_KEGG-pathways.xlsx.gz

 

Homology

Homology of the Gossypium raimondii NSF Genome v1.0 proteins was determined by pairwise sequence comparison using the blastp algorithm against various protein databases. An expectation value cutoff less than 1e-9 was used for the NCBI nr (Release 2018-05) and 1e-6  for the Arabidoposis proteins (Araport11), UniProtKB/SwissProt (Release 2019-01), and UniProtKB/TrEMBL (Release 2019-01) databases. The best hit reports are available for download in Excel format. 

 

Protein Homologs

G. raimondii BYU Genome v1.0 proteins with NCBI nr homologs (EXCEL file) D5_NSF_vs_nr.xlsx.gz
G. raimondii BYU Genome v1.0 proteins with NCBI nr (FASTA file) D5_NSF_vs_nr_hit.fasta.gz
G. raimondii BYU Genome v1.0 proteins without NCBI nr (FASTA file) D5_NSF_vs_nr_noHit.fasta.gz
G. raimondii BYU Genome v1.0 proteins with arabidopsis (Araport11) homologs (EXCEL file) D5_NSF_vs_tair.xlsx.gz
G. raimondii BYU Genome v1.0 proteins with arabidopsis (Araport11) (FASTA file) D5_NSF_vs_tair_hit.fasta.gz
G. raimondii BYU Genome v1.0 proteins without arabidopsis (Araport11) (FASTA file) D5_NSF_vs_tair_noHit.fasta.gz
G. raimondii BYU Genome v1.0 proteins with SwissProt homologs (EXCEL file) D5_NSF_vs_swissprot.xlsx.gz
G. raimondii NSF Genome v1.0 proteins with SwissProt (FASTA file) D5_NSF_vs_swissprot_hit.fasta.gz
G. raimondii BYU Genome v1.0 proteins without SwissProt (FASTA file) D5_NSF_vs_swissprot_noHit.fasta.gz
G. raimondii BYU Genome v1.0 proteins with TrEMBL homologs (EXCEL file) D5_NSF_vs_trembl.xlsx.gz
G. raimondii BYU Genome v1.0 proteins with TrEMBL (FASTA file) D5_NSF_vs_trembl_hit.fasta.gz
G. raimondii BYU Genome v1.0 proteins without TrEMBL (FASTA file) D5_NSF_vs_trembl_noHit.fasta.gz

 

Transcript Alignments
Transcript alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map transcripts to the G. raimondii genome assembly. Alignments with an alignment length of 97% and 90% identify were preserved. The available files are in GFF3 format.

 

G. arboreum CottonGen RefTrans v1 G.raimondii_NSF-v1.0_g.arboreum_cottongen_reftransV1
G. barbadense CottonGen RefTrans v1 G.raimondii_NSF-v1.0_g.barbadense_cottongen_reftransV1
G. hirsutum CottonGen RefTrans v1 G.raimondii_NSF-v1.0_g.hirsutum_cottongen_reftransV1
G. raimondii CottonGen RefTrans v1 G.raimondii_NSF-v1.0_g.raimondii_cottongen_reftransV1
Markers
Marker alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map marker sequences from CottonGen to the Gossypium raimondii genome assembly. Markers required 90% identity over 97% of their length. For SSRs & RFLPs, gap size was restricted to 1000bp or less with less than 2 gaps. For dbSNPs and Indels gap size was restricted to 2bp with less than 2 gaps. The available files are in GFF3 format. Markers available in CottonGen are linked to JBrowse.
 
CottonGen SNP markers mapped to genome G.raimondii_D5_NSF-v1.0_SNP
CottonGen InDel markers mapped to genome G.raimondii_D5_NSF-v1.0_InDel
CottonGen RFLP markers mapped to genome G.raimondii_D5_NSF-v1.0_RFLP
CottonGen SSR markers mapped to genome G.raimondii_D5_NSF-v1.0_SSR