Background: High-throughput genotyping platforms play important roles in plant genomic studies. Cotton (Gossypium spp.) is the world’s important natural textile fiber and oil crop. Upland cotton accounts for more than 90% of the world’s cotton production, however, modern upland cotton cultivars have narrow genetic diversity. The amounts of genomic sequencing and re-sequencing data released make it possible to develop a high-quality single nucleotide polymorphism (SNP) array for intraspecific genotyping detection in cotton.
Results: Here we report a high-throughput CottonSNP80K array and its utilization in genotyping detection in different cotton accessions. 82,259 SNP markers were selected from the re-sequencing data of 100 cotton cultivars and used to produce the array on the Illumina Infinium platform. 77,774 SNP loci (94.55%) were successfully synthesized on the array. Of them, 77,252 (99.33%) had call rates of >95% in 352 cotton accessions and 59,502 (76.51%) were polymorphic loci. Application tests using 22 cotton accessions with parent/F1 combinations or with similar genetic backgrounds showed that CottonSNP80K array had high genotyping accuracy, good repeatability, and wide applicability. Phylogenetic analysis of 312 cotton cultivars and landraces with wide geographical distribution showed that they could be classified into ten groups, irrelevant of their origins. We found that the different landraces were clustered in different subgroups, indicating that these landraces were major contributors to the development of different breeding populations of modern G. hirsutum cultivars in China. We integrated a total of 54,588 SNPs (MAFs >0.05) associated with 10 salt stress traits into 288 G. hirsutum accessions for genome-wide association studies (GWAS), and eight significant SNPs associated with three salt stress traits were detected.
Conclusions: We developed CottonSNP80K array with high polymorphism to distinguish upland cotton accessions. Diverse application tests indicated that the CottonSNP80K play important roles in germplasm genotyping, variety verification, functional genomics studies, and molecular breeding in cotton.
We developed CottonSNP80K array with high polymorphism to distinguish upland cotton accessions. Diverse application tests indicated that the CottonSNP80K play important roles in germplasm genotyping, variety verification, functional genomics studies, and molecular breeding in cotton.
Compared to SNP loci in CottonSNP63K array, which were collected from 13 different discovery sets of G. hirsutumgermplasm and five other species, the SNP loci in CottonSNP80K array benefited from the whole genome sequencing of G. hirsutum acc. TM-1 , and 1,372,195 intraspecific non-unique SNPs identified by re-sequencing of G. hirsutum accessions , therefore the selected SNPs in CottonSNP80K could be distributed along the entire genome. Secondly, the CottonSNP63K array contains 63,058 markers, including 45,104 intraspecific SNPs and 17,954 interspecific SNPs, whereas the CottonSNP80K array increased the total number of markers to 77,774. With requirement of MAFs > 0.1 by analyzing the re-sequencing data of different cotton accessions, the SNPs in CottonSNP80K showed five to six times upland cotton intraspecific polymorphism compared with that in CottonSNP63K. In the recent reports, using the CottonSNP63K array, Huang et al. (2017)  detected 11,975 quantified polymorphic SNPs in a diverse and nationwide population containing 503 G. hirsutumaccessions, and Sun et al. (2017)  detected 10,511 polymorphic SNPs using 719 diverse accessions of upland cotton. In the present study, the number of polymorphic markers for upland cotton intraspecific genotyping detection was increased to 59,502 using the CottonSNP80K array. Thirdly, compared with the CottonSNP63K array, each SNP marker in the CottonSNP80K array is addressable, which avoids the disturbance of homeologous/paralogous genes. During the development of the CottonSNP80K array, we also considered factors affecting the array quality, including flanking sequence information, Illumina design scores, heterozygosity rates, cluster results, which ensures that it is of high quality in upland cotton genotyping detection.
Protocols or Array Development
To develop the genome-wide CottonSNP80K chip, an Illumina Infinium array, as well as intraspecific SNPs data from sequencing of the allotetraploid cotton G. hirsutum acc. TM-1and re-sequencing of 100 different cultivars in G. hirsutumwith 5× coverage on average were used. In total, 1,372,195 putative intraspecific SNPs with MAF > 0.1 were detected and chosen for inclusion on the array. When designing the array, subsequent filtering steps included the following: (1) genotype accuracy was required to be >99.12%; (2) SNPs in repeat regions were filtered; (3) no other SNPs or InDels were permitted in the 50 bp flanking the SNP site; (4) heterozygosity rates were required to be <15%; (5) SNP cluster analysis was carried out. After these filters were applied, 175,192 SNPs remained and were submitted through the Illumina Assay Design Tool to determine array design scores for each marker. SNPs in gene regions with Illumina design scores >0.7, and SNPs in intergenic regions with Illumina design scores >0.9 remained. Further, the inter-marker distance flanking the SNPs was >2100 bp. The remaining 82,259 SNP markers were used for the manufacture of the CottonSNP80K array by Illumina (Additional file 6: Table S5). The scheme of CottonSNP80K development is shown below: