Gossypium raimondii (D5) 'D5-4' genome NSF_v1

Overview


Analysis Name	Gossypium raimondii (D5) 'D5-4' genome NSF_v1
Method	PacBio Sequel (Canu v. 1.6)
Source	D5.v1.pred.fa (=raimondii.juiced_V1.0.fasta)
Date performed	2019-07-03

Material and Methods

DNA was extracted from G. raimondii (accession D5-4) plants using CTAB techniques (Kidwell and Osborn 1992). DNA concentration was measured by a Qubit Fluorometer (ThermoFisher, Inc.). The sequencing library was constructed according to PacBio recommendations at the BYU DNA Sequencing Center (DNASC). Fragments >18 kb were selected for sequencing via BluePippen (Sage Science, LLC). Prior to sequencing, the size distribution of fragments in the libraries was evaluated using a Fragment Analyzer (Advanced Analytical Technologies, Inc). Eight and eleven PacBio cells were sequenced from a single library for G. raimondii, respectively, on the Pacific Biosciences Sequel system. For both genomes, the raw PacBio sequencing reads were assembled using Canu V1.6 using default parameters (Koren et al. 2017).

HiC libraries were constructed from G. raimondii leaf tissue at NorthEast Normal University, China. Sequencing was performed at Annoroad Gene Technology Co., Ltd (Beijing, China). The HiC data of G. raimondii was mapped to the previous genome sequence of G. raimondii using HiC-Pro (Servant et al. 2015), and to the newly assembled CANU contigs of G. raimondii PacBio reads by PhaseGenomics. The HiC interactions were used as evidence for contig proximity and in scaffolding contig sequences. An initial draft genome sequence of pseudochromosomes (PGA assembly) was created using a custom python script from PhaseGenomics.

DNA was also extracted from young G. raimondii leaves following the Bionano Plant protocol for high-molecular weight DNA. DNA was purified, nicked, labeled, and repaired according to Bionano standard operating procedures for the Irys platform. Two optical maps of different enzymes (BspQI and BssSI) were assembled using the IrysSolve pipeline on the BYU Fulton SuperComputing cluster. The optical maps were combined into a two-enzyme composite optical map and it was aligned to the PGA assembly using an in silico labeled reference sequence. Conflicts between the Bionano maps and the PGA assembly were manually identified in the Bionano Access software by comparing the mapped Bionano contigs to the CANU contigs along the draft genome sequence. Conflicts between datasets were resolved by repositioning and reorienting CANU contigs in PGA ordering files followed by reconstruction of the fasta sequence, provided there was supporting or no-conflict evidence from the optical map (Durand et al. 2016). Multiple iterations of mapping, conflict resolution, and draft sequence construction resulted in the final, new genome sequence of G. raimondii.

About the assembly

The G. raimondii genome was assembled from 43.7x PacBio coverage of raw sequence reads. The assembly consisted of 187 contigs with an N50 of 6.3MB (Table 1). The contigs were
scaffolded using HiC by PhaseGenomics and the pseudomolecules were manually adjusted using JuiceBox (Durand et al. 2016). The final scaffolded assembly was independently verified using a composite optical map of two different enzymes. A comparison of assembly metrics between the previous genome sequence and our new genome sequence of G. raimondii illustrates a 45x improvement in contig length and a 97x reduction in the number of gaps. The cumulative gap length of the new assembly (17.6 kb) was reduced by 647x compared to the assembled gaps of the previous genome sequence (11,391 kb). The final genome assembly size was 14.9 MB smaller than the previous assembly, representing 98% of previously assembled genome sequence in length.

Table 1. Assembly metrics of the G. turneri genome, the G. raimondii (our current assembly, D5), and the previous G. raimondii assembly (Paterson et al. 2012).

Summary of Improved Assembly	G. turneri (D10-NSF)	G. raimondii (D5-NSF current)	G. raimondii (D5-JGI 2012)
Contigs	220	187	16,924
Max Contig	23,475,487	24,216,129	1,162,971
Mean Contig	3,432,648	3,929,767	43,597
Contig N50	7,909,293	6,291,832	136,998
Contig N90	1,624,019	2,044,991	32,166
Total Contig Length	755,182,540	734,866,495	737,837,083
Assembly GC	33.21	33.19	33.19
Max Scaffolds	67,704,245	65,701,939	70,713,020
Mean Scaffold	58,092,557	56,529,546	57,632,930
Scaffold N50	60,464,062	58,819,159	62,175,169
Scaffold N90	50,570,303	46,322,098	45,765,648
Total Scaffold Length	755,203,240	734,884,094	749,228,090
Captured Gaps	207	174	16,911
Max Gap	100	200	63,138
Mean Gap	100	101	674
Gap N50	100	100	2,607
Total Gap Length	20,700	17,599	11,391,007

Publication

Udall, et al., 2019. De novo genome sequence assemblies of Gossypium raimondii and G. turneri.

Assembly

The chromosomes (de novo) for G. raimondii 'D5-4' genome. These files belong to the NSF-D5 (also called BYU-D5) Assembly v1.0

Chromosomes (FASTA format)	G.raimondii_NSF-D5_assembly_v1.fasta.gz
Chromosomes (GFF3 format)	G.raimondii_NSF-D5_assembly_v1.gff3

Downloads

All annotation files are available for download by selecting the desired data type in the left-hand side bar. Each data type page will provide a description of the available files and links do download.

Functional Analysis

Functional annotation files for the Gossypium raimondii NSF Genome v1.0 are available for download below. The Gossypium raimondii NSF Genome proteins were analyzed using InterProScan in order to assign InterPro domains and Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS).

Downloads

GO assignments from InterProScan	D5_NSF-v1.0_genes2GO.xlsx.gz
IPR assignments from InterProScan	D5_NSF-v1.0_genes2IPR.xlsx.gz
Proteins mapped to KEGG Orthologs	D5_NSF-v1.0_KEGG-orthologis.xlsx.gz
Proteins mapped to KEGG Pathways	D5_NSF-v1.0_KEGG-pathways.xlsx.gz

Genes

The predicted gene model, their alignments and proteins for G. raimondii 'D5-4' genome. These files belong to the NSF-D5 Assembly v1 & Annotation a1

Transcripts sequences (FASTA format)	G.raimondii_NSF-D5_v1_a1_transcripts.gz
Predicted gene models with exons (GFF3 format)	G.raimondii_NSF-D5_v1_a1_predicted_genes.gff3.gz
Protein sequences (FASTA format)	G.raimondii_NSF-D5_v1_a1_proteins.gz

Homology

Homology of the Gossypium raimondii NSF Genome v1.0 proteins was determined by pairwise sequence comparison using the blastp algorithm against various protein databases. An expectation value cutoff less than 1e^-9was used for the NCBI nr (Release 2018-05) and 1e^-6 for the Arabidoposis proteins (Araport11), UniProtKB/SwissProt (Release 2019-01), and UniProtKB/TrEMBL (Release 2019-01) databases. The best hit reports are available for download in Excel format.

Protein Homologs

G. raimondii BYU Genome v1.0 proteins with NCBI nr homologs (EXCEL file)	D5_NSF_vs_nr.xlsx.gz
G. raimondii BYU Genome v1.0 proteins with NCBI nr (FASTA file)	D5_NSF_vs_nr_hit.fasta.gz
G. raimondii BYU Genome v1.0 proteins without NCBI nr (FASTA file)	D5_NSF_vs_nr_noHit.fasta.gz
G. raimondii BYU Genome v1.0 proteins with arabidopsis (Araport11) homologs (EXCEL file)	D5_NSF_vs_tair.xlsx.gz
G. raimondii BYU Genome v1.0 proteins with arabidopsis (Araport11) (FASTA file)	D5_NSF_vs_tair_hit.fasta.gz
G. raimondii BYU Genome v1.0 proteins without arabidopsis (Araport11) (FASTA file)	D5_NSF_vs_tair_noHit.fasta.gz
G. raimondii BYU Genome v1.0 proteins with SwissProt homologs (EXCEL file)	D5_NSF_vs_swissprot.xlsx.gz
G. raimondii NSF Genome v1.0 proteins with SwissProt (FASTA file)	D5_NSF_vs_swissprot_hit.fasta.gz
G. raimondii BYU Genome v1.0 proteins without SwissProt (FASTA file)	D5_NSF_vs_swissprot_noHit.fasta.gz
G. raimondii BYU Genome v1.0 proteins with TrEMBL homologs (EXCEL file)	D5_NSF_vs_trembl.xlsx.gz
G. raimondii BYU Genome v1.0 proteins with TrEMBL (FASTA file)	D5_NSF_vs_trembl_hit.fasta.gz
G. raimondii BYU Genome v1.0 proteins without TrEMBL (FASTA file)	D5_NSF_vs_trembl_noHit.fasta.gz

Markers

Marker alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map marker sequences from CottonGen to the Gossypium raimondii genome assembly. Markers required 90% identity over 97% of their length. For SSRs & RFLPs, gap size was restricted to 1000bp or less with less than 2 gaps. For dbSNPs and Indels gap size was restricted to 2bp with less than 2 gaps. The available files are in GFF3 format. Markers available in CottonGen are linked to JBrowse.

CottonGen SNP markers mapped to genome	G.raimondii_D5_NSF-v1.0_SNP
CottonGen InDel markers mapped to genome	G.raimondii_D5_NSF-v1.0_InDel
CottonGen RFLP markers mapped to genome	G.raimondii_D5_NSF-v1.0_RFLP
CottonGen SSR markers mapped to genome	G.raimondii_D5_NSF-v1.0_SSR

Transcript Alignments

Transcript alignments were performed by the CottonGen Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map transcripts to the G. raimondii genome assembly. Alignments with an alignment length of 97% and 90% identify were preserved. The available files are in GFF3 format.

G. arboreum CottonGen RefTrans v1	G.raimondii_NSF-v1.0_g.arboreum_cottongen_reftransV1
G. barbadense CottonGen RefTrans v1	G.raimondii_NSF-v1.0_g.barbadense_cottongen_reftransV1
G. hirsutum CottonGen RefTrans v1	G.raimondii_NSF-v1.0_g.hirsutum_cottongen_reftransV1
G. raimondii CottonGen RefTrans v1	G.raimondii_NSF-v1.0_g.raimondii_cottongen_reftransV1

Links

Links:

BLAST

View Alignment in JBrowse

View synteny

Search form

Gossypium raimondii (D5) 'D5-4' genome NSF_v1