Gossypium raimondii (D5) genome NSF Assembly v1 & Annotation a1
Material and Methods
DNA was extracted from G. raimondii (accession D5-4) plants using CTAB techniques (Kidwell and Osborn 1992). DNA concentration was measured by a Qubit Fluorometer (ThermoFisher, Inc.). The sequencing library was constructed according to PacBio recommendations at the BYU DNA Sequencing Center (DNASC). Fragments >18 kb were selected for sequencing via BluePippen (Sage Science, LLC). Prior to sequencing, the size distribution of fragments in the libraries was evaluated using a Fragment Analyzer (Advanced Analytical Technologies, Inc). Eight and eleven PacBio cells were sequenced from a single library for G. raimondii, respectively, on the Pacific Biosciences Sequel system. For both genomes, the raw PacBio sequencing reads were assembled using Canu V1.6 using default parameters (Koren et al. 2017).
HiC libraries were constructed from G. raimondii leaf tissue at NorthEast Normal University, China. Sequencing was performed at Annoroad Gene Technology Co., Ltd (Beijing, China). The HiC data of G. raimondii was mapped to the previous genome sequence of G. raimondii using HiC-Pro (Servant et al. 2015), and to the newly assembled CANU contigs of G. raimondii PacBio reads by PhaseGenomics. The HiC interactions were used as evidence for contig proximity and in scaffolding contig sequences. An initial draft genome sequence of pseudochromosomes (PGA assembly) was created using a custom python script from PhaseGenomics.
DNA was also extracted from young G. raimondii leaves following the Bionano Plant protocol for high-molecular weight DNA. DNA was purified, nicked, labeled, and repaired according to Bionano standard operating procedures for the Irys platform. Two optical maps of different enzymes (BspQI and BssSI) were assembled using the IrysSolve pipeline on the BYU Fulton SuperComputing cluster. The optical maps were combined into a two-enzyme composite optical map and it was aligned to the PGA assembly using an in silico labeled reference sequence. Conflicts between the Bionano maps and the PGA assembly were manually identified in the Bionano Access software by comparing the mapped Bionano contigs to the CANU contigs along the draft genome sequence. Conflicts between datasets were resolved by repositioning and reorienting CANU contigs in PGA ordering files followed by reconstruction of the fasta sequence, provided there was supporting or no-conflict evidence from the optical map (Durand et al. 2016). Multiple iterations of mapping, conflict resolution, and draft sequence construction resulted in the final, new genome sequence of G. raimondii.
About the assembly
The G. raimondii genome was assembled from 43.7x PacBio coverage of raw sequence reads. The assembly consisted of 187 contigs with an N50 of 6.3MB (Table 1). The contigs were
Table 1. Assembly metrics of the G. turneri genome, the G. raimondii (our current assembly, D5), and the previous G. raimondii assembly (Paterson et al. 2012).
The chromosomes (de novo) for G. raimondii 'D5-4' genome. These files belong to the NSF-D5 (also called BYU-D5) Assembly v1.0
The predicted gene model, their alignments and proteins for G. raimondii 'D5-4' genome. These files belong to the NSF-D5 Assembly v1 & Annotation a1
All annotation files are available for download by selecting the desired data type in the left-hand "Resources" side bar. Each data type page will provide a description of the available files and links do download. Alternatively, you can use the FTP repository for bulk download.
Functional annotation files for the Gossypium raimondii NSF Genome v1.0 are available for download below. The Gossypium raimondii NSF Genome proteins were analyzed using InterProScan in order to assign InterPro domains and Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS).
Homology of the Gossypium raimondii NSF Genome v1.0 proteins was determined by pairwise sequence comparison using the blastp algorithm against various protein databases. An expectation value cutoff less than 1e-9 was used for the NCBI nr (Release 2018-05) and 1e-6 for the Arabidoposis proteins (Araport11), UniProtKB/SwissProt (Release 2019-01), and UniProtKB/TrEMBL (Release 2019-01) databases. The best hit reports are available for download in Excel format.
Transcript alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map transcripts to the G. raimondii genome assembly. Alignments with an alignment length of 97% and 90% identify were preserved. The available files are in GFF3 format.
Marker alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map marker sequences from CottonGen to the Gossypium raimondii genome assembly. Markers required 90% identity over 97% of their length. For SSRs & RFLPs, gap size was restricted to 1000bp or less with less than 2 gaps. For dbSNPs and Indels gap size was restricted to 2bp with less than 2 gaps. The available files are in GFF3 format. Markers available in CottonGen are linked to JBrowse.