Transcriptome analysis of extant cotton progenitors and identification of genome-specific-single nucleotide polymorphism (GNP) for allelic discrimination in allotetraploid cotton

Working group session: 
Breeding and Applied Genomics
Presentation type: 
oral
Authors: 
Nah, Gyoungju; Guan, Xueying; Udall, Joshua; Stelly, David; Chen, Z. Jeffrey
Presenter: 
Nah, Gyoungju
Correspondent: 
Chen, Z. Jeffrey
Abstract: 
The most widely cultivated cotton (Goyssipum hirsutum L., AADD) is derived from tetraploidization between AA-genome and DD-genome species 1-2 million years ago. Although the exact progenitors are unknown, Gossypium arboreum L. (AA) or G. herbaecum L. (AA) and G. raimondii Ulbr. (DD) are thought to be extant progenitors for the allotetraploid cotton. Gene expression studies in allotetraploid cotton are complicated by the homoeologous loci derived from AA and DD progenitor species. To develop genomic resources for gene expression and cotton improvement, we performed RNA-sequencing (RNA-seq) analysis with 454/Roche using normalized cDNA libraries prepared from young leaves, roots, bolls, ovules, and fibers of G. arboreum L. (AA) and G. raimondii (DD). The combined sequencing analysis using 1,699,776 reads from AA and 1,464,815 reads from DD resulted in 89,588 contigs for AA and 65,542 contigs for DD. The majority of contigs (~70%) ranged from 100 to 1,000-bp. These contigs represented ~80% of all cotton gene collections in Cotton Gene Index 11 (CGI11), which is the most updated public cotton EST database, including those from AA, DD, and AADD species. About 20% of the contigs were not present in CGI11. Analysis using self-blastn reduced the unigene contig numbers by 52% in AA, and 57% in DD, suggesting that 50% or more of contigs are paralogs or isoforms with each species. Between AA and DD EST collections, the majority (73-81%) were conserved, whereas 27% and 19% contigs were specific to AA and DD species, respectively. Genome-wide analysis using Blastx against a protein database that includes Arabidopsis identified 52% (AA) and 63% (DD) of matched sequences. Gene ontology (GO) analysis of matched contigs found over-representation of genes in the metabolic process and cellular process in both species. Moreover, 2,901 miRNAs from miRBase mapped onto 1,145 contigs of AA, while 3,103 miRNAs mapped onto 1,084 contigs of DD, suggesting a good representation of miRNA precursors or targets in these EST collections. Using these ESTs, we generated a total of 34,985 GNPs and 4,822 indels from ~11,000 contigs or genes, indicating a possibility of separating allelic expression for these genes. A set of randomly selected 105 contigs in two species is being validated for their allelic detection in allotetraploid cotton. This large set of AA and DD genome ESTs and GNPs will be valuable resources for gene expression and crop improvement in cultivated allotetraploid cotton.