TAMU CottonSNP63K Array

Project Information

Project Data

Summary

High‐throughput genotyping arrays provide a standardized resource for plant breeding communities that are useful for a breadth of applications including high‐density genetic mapping, genome‐wide association studies (GWAS), genomic selection (GS), complex trait dissection and studying patterns of genomic diversity among cotton cultivars and wild accessions. The CottonSNP63K, an Illumina Infinium array contains assays for 45,104 putative intra‐specific single nucleotide polymorphism (SNP) markers for use within the cultivated cotton species Gossypium hirsutum L. and 17,954 putative inter‐specific SNP markers for use with crosses of other cotton species with G. hirsutum. The SNPs on the array were developed from 13 different discovery sets that represent a diverse range of G. hirsutum germplasm and five other species: G. barbadense L., G. tomentosum Nuttal ex Seemann, G. mustelinum Miers x Watt, G. armourianum Kearny, and G. longicalyx J.B. Hutchinson & Lee. The array was validated with 1,156 samples to generate cluster positions to facilitate automated analysis of 38,822 polymorphic markers. Two high‐density genetic maps containing a total of 22,829 SNPs were generated for two F2 mapping populations, one intra‐specific (~7,000 loci) and one interspecific (~19,000 loci), between which about 3,500 SNP markers are shared. The intra‐specific genetic map is the first saturated map that associates into 26 linkage groups corresponding to the number of cotton chromosomes for a cross between two G. hirsutum lines. The linkage maps were shown to have high levels of collinearity to the JGI G. raimondii Ulbrich reference genome sequence. These provide the global cotton research community a valuable new resource.

Objectives

Objective 1: to utilize recently identified SNPs to develop a standardized large-scale SNP genotyping array for cotton;

Objective 2: to evaluate the performance and reproducibility of the array on a large set of samples;

Objective 3: to develop a cluster file that can be used to help automate genotyping for allotetraploid cotton;

Objective 4: to produce high-density linkage maps based on two biparental F₂ populations in an intraspecific cross (G. hirsutum × G. hirsutum) and an interspecific cross (G. barbadense × G. hirsutum).

Completion of these objectives will enable reliable standardized high-throughput SNP genotyping on large cotton populations for many diverse applications, including high-density genetic mapping, genome-wide associated studies, genomic selection, complete trait dissection, and studying patterns of genomic diversity among cotton cultivars and wild accessions. Online available information on the array-based genotyping via CottonGen will be used as a community supported database.

Participants

Name	Role	Institution
Stelly, David M.		Texas A&M University, College Station, TX, USA
Hulse-Kemp, Amanda M.		Texas A&M University, College Station, TX, USA
Van Deynze, Allen		University of California-Davis, USA
Ashrafi, Hamid		University of California-Davis, USA

Protocols

Plant materials

Total 1,156 lines or samples of 16 different types (Table 1) were used in the development. Seeds for each line were planted in peat pellets (Jiffy, Canada) at Texas A&M University, CIRAD/EMBRAPA, USDA-ARS-SPARC, USDA-ARS-SRRC, or CSIRO greenhouses. Plants were allowed to grow until first true leaves were available. Young true leaves were sampled and extracted using the Macherey-Nagel Plant Nucleo-spin kit (Pennsylvania) according to the manufacturer’s instructions. All DNA samples were quantified using PicoGreen and then diluted to 50 ng/µl. The samples included reference lines from the various SNP development efforts, duplicated DNA samples, individual plants and plant pools from the same seed source, individual plant samples from different seed sources, parent/F₁ combinations, segregating samples from wild Gossypium species, inbred cultivar lines, wild G. hirsutum lines, and two mapping populations, one intraspecific and one interspecific (Table 1). This material represented samples from most of the cross-compatible range of G. hirsutum. Mapping samples included 93 F₂ lines from a G. hirsutum cv. Phytogen 72 × G. hirsutum cv. Stoneville 474 population designated as “PS” for intraspecific mapping and 118 F₂ lines from a G. barbadense doubled haploid line 3-79 × G. hirsutum cv. Texas Marker-1 population designated as “T3” for interspecific mapping. Samples for the mapping populations were chosen randomly from germinated seed within each population.

Leaf tissue or high-quality DNA samples can be provided for each cotton sample for processing on the CottonSNP63K array. Samples are processed through a three-day pipeline to provide raw fluorescent image data for all 63K functional assays on the array. These raw files are processed through GenomeStudio software with the developed cluster files to allow for automated genotyping of the indicated cluster positions. The Cluster file can be found HERE. A second cluster file is being developed for enhanced automated genotype calling of SNPs associated with diploid germplasm introgressed into G. hirsutum (contact Dr. David Stelly for details).

To produce standardized genotype output for samples that can be deposited into the CottonGen database, please follow the following procedure in GenomeStudio.

This will ensure that protocols and reporting formats are identical across groups and genotypes can be directly compared in the database. (There may be some cases for which you may wish to manually adjust cluster positions for polymorphism that may be specific for your population that were not represented in the samples utilized in the cluster file development effort. If you find this to be the case, please alert the database curator, Dr. Jing Yu, when submitting genotype data.)

Standardized Genotype Output

To import Cotton Array Data in GenomeStudio

You will need the sample sheet you intend to use for the project, the iDAT files for the samples included in the sample sheet, the cluster file available HERE, and the SNP Manifest file from Illumina. Genotyping Projects using the developed cluster file should utilize a GenCall cutoff standard of 0.05, to allow for accurate calling of polyploid type loci that exhibit nearby clusters.

Step-by-step instructions for generating a New Project in GenomeStudio can be found HERE.

To export Cotton Array Data from GenomeStudio

Using the Report Wizard, select Final Report.
In the Samples window, select Include Zeroed & Intensity Only SNPs.
In the Final Report Format window, select Matrix format.
- Matrix Format Options – Forward Strand and un-select GenCall Score.
In the General Options window – use default selection of Tab-Delimited File.
In the Destination window – Type in desired Report Name.

You should get a report that is similar to this (head of file):

[Header]

GSGT Version 1.9.4

Processing Date 5/7/2015 4:34 PM

Content IntlCottonSNPConsortium_70k_11735168_A.bpm

Num SNPs 63058

Total SNPs 63058

Num Samples 7

Total Samples 7

[Data]

                    s1 s2    s3
i00002Gh     TT     TT TT
i00003Gh CC    CC CC
i00004Gh     TT     TT     TT
i00005Gh --    --     --
i00006Gh     AA    AA AA
i00007Gh     --      --    --
i00008Gh     TT TT     TT
i00009Gh     AC    AC AC

For inclusion in the CottonGen Database, two letter genotypes have been converted to single letter using IUPAC notation.

Developed Resources

Index

Samples included for array validation and cluster file development
Datasets utilized in intraspecific content design on the CottonSNP63K array
Datasets utilized in interspecific content design the CottonSNP63K array
SNP markers shared across five species included on the CottonSNP63K array from TAMU/UC-Davis Inter RNA-seq discovery set
Genotyped Lines

Table 1. Samples included for array validation and cluster file development (Hulse-Kemp et al. 2015)

Sample Type	No. of Samples
Inbred, G. hirsutum (cultivated)	516
Inbred, G. hirsutum (wild)	59
Intraspecific F1	53
Intraspecific F2	157
Intraspecific backcross	21
Intraspecific RIL	24
Inbred, G. barbadense	18
Interspecific F₂	69 (49^{^a})
Interspecific RIL	14
Interspecific aneuploid	21^{^b}
Wild tetraploid species	4
Synthetic tetraploid	3
Diploid species	8^{^b}
Interspecific backcross	146
Interspecific F₁	20
Haploid	3^{^b}
Total	1156

a - A total of 49 interspecific F2 samples were not included in cluster file development but were genotyped using the resulting cluster file for inclusion in linkage mapping.
b - These samples were used in the cluster file development, but the cluster file is not suitable for scoring such samples because it is only optimized for tetraploid samples.

Table 2: Datasets utilized in intraspecific content design on the CottonSNP63K array (Hulse-Kemp et al. 2015)

Data Set Name	Authors/Reference	Lines
Brigham Young University	Byers et al. (2012)	Acala Maxxa, TX2094
CSIR-NBRI	Rai et al. (2013)	JKC703, JKC725, JKC737, JKC770, MCU-5, LRA5166
USDA set 1	Gore et al. (2014)	TM-1, NM24016
UC-Davis/TAMU GH RNA-seq	Ashrafi et al. (2015)	TM-1, FM832, Sealand 542, PD-1, Acala Maxxa
USDA set 2	Islam et al. (2014)	Acala Ultima, Pyramid, Coker 315, STV825, FM966, M-240 RNR, HS26, DP-90, SG747, PSC355, STV474
CSIRO	Zhu et al. (2014)	MCU-5, Delta Opal, Sicot 70, Siokra 1-4, DP-16, Tamcot SP37, Namcala, Riverina Poplar, LuMein 14, Sicala 3-2, Sicala 40, Sicala V-2, Sicot 81, Sicot 71, Sicot 189, Sicot F-1, Deltapine 90, Coker 315
TAMU/UC-Davis Intra Genomic Set 1	Hulse-Kemp et al. (2015)	M-240 RNR, TM-1, HS26, SG747, STV474, FM832, Sealand 542, PD-1, Coker 312, Tamcot Sphinx, TX231, Acala Maxxa
TAMU/UC-Davis Intra Genomic Set 2	Hulse-Kemp et al. (uppublished)	M-240 RNR, TM-1, HS26, SG747, STV474, FM832, Sealand 542, PD-1, Coker 312, Tamcot Sphinx, TX231, Acala Maxxa
DOW AgroSciences	DAS (unpublished)	Unreleased

A total of 50K putative single nucleotide markers were used to produce the 45,104 intraspecific assays on the array after production. DAS, DOW AgroSciences.

Table 3 Datasets utilized in interspecific content design the CottonSNP63K array (Hulse-Kemp et al. 2015)

Data Set Name	Authors/Reference	Lines
UC-Davis Inter Genomic	Van Deynze et al. (2009)	G. barbadense (3-79), G. hirsutum (TM-1)
CIRAD	Lacape et al. (2012)	G. barbadense (VH8-4602), G. hirsutum (Guazuncho II)
TAMU/UC-Davis Inter RNA-seq	Hulse-Kemp et al. (2014)	G. barbadense (3-79), G. tomentosum, G. mustelinum, G. armourianum, G. longicalyx
TAMU/UC-Davis Inter Genomic	Hulse-Kemp et al. (2015)	G. barbadense (3-79), G. hirsutum (TM-1)

Figure 1. Overlap of SNPs among species (Hulse-Kemp et al. 2014)

SNP markers shared across five species included on the CottonSNP63K array from TAMU/UC-Davis Inter RNA-seq discovery set The overlap and specificity of the Class I and Class II SNPs for G. barbadense cv. 3–79, G. tomentosum, G. mustelinum, G. armourianum, and G. longicalyx.

Genotyped lines

Selected lines

need a brief description on how to select -- AD, cultivar, etc. (395 lines)

Mapping populations

Inter-specific: (G. barbadense) 3-79 x TM-1 (G. hirsutum) interspecies (118 F2s)
Intra-specific (G. hirsutum): Phytogen 72 x Stoneville 474 intraspecies (93 F2s)

Illumina Genotyping

Genotyping with the array

DNA standardized at 50 ng/µl for each of the cotton lines described above was processed according to Illumina protocols and hybridized to the CottonSNP63K array at Texas A&M University or CSIRO. Single-base extension was performed and the chips were scanned using the Illumina iScan. Image files were saved for cluster file analysis. All image files were uploaded into a single GenomeStudio project containing 1156 individual samples. Of the 70,000 SNPs targeted for manufacture on the array, 6942 markers failed to meet standards for bead representation and decoding metrics during the array construction process at Illumina and were removed from the manifest. Data for the remaining markers were clustered using the GenomeStudio Genotyping Module (V 1.9.4, Illumina, Inc.). All markers were viewed and manually curated, taking into account the sample type and known segregation ratios, for construction of the best cluster file for genotyping tetraploid cotton (available at HERE).

Illumina Infinium Platform

The Infinium platform detects SNP alleles by adding a fluorescence-labeled allele-specific nucleotide via single-base extension and subsequent detection of the fluorescent color. The Illumina Infinium chips are sophisticated silicon-based array devices. If you are interested to learn about the technology that drives the Infinium genotyping arrays, watch the following video:

Types of call frequency of SNP markers

Types of call frequency of SNP markers. NormTheta or relative amount of each of the two fluorophore signals is plotted on the X-axis, whereas NormR or signal intensity is plotted on the Y-axis. (A) Failed marker with call frequency = 0. (B) Call frequency 0.500–0.990 with major sample deviations. (C) Call frequency 0.990–0.999 with few uncalled samples. (D) Call frequency = 1 with all called samples. (E) Distribution of call frequencies for all SNP markers on the array.

Classification of scorable SNP markers according to Illumina GenTrain score

Figure 3. Classification of scorable SNP markers according to Illumina GenTrain score. NormTheta or relative amount of each of the two fluorophore signals is plotted on the X-axis, whereas NormR or signal intensity is plotted on the Y-axis. (A) Monomorphic marker. (B) Intergenomic or homeo-SNP marker. (C–F) Classification of polymorphic markers based on Illumina GenTrain score. (C) Genome-specific marker representing a single polymorphic locus with GenTrain score >0.60. (D) Marker with GenTrain score 0.30–0.59 on half the plot representing two genomes, one monomorphic and one polymorphic locus. (E) Marker with GenTrain score 0.21–0.29 representing multiple monomorphic loci and one polymorphic locus. (F) Marker with GenTrain score less than 0.20 representing many monomorphic loci and one polymorphic locus. (G) Distribution of cluster types in polymorphic markers based on GenTrain score.

Publications

Ashrafi H, Hulse-Kemp AM, Wang F, Yang SS, Guan X, Jones DC, Matvienko M, Mockaitis K, Chen ZJ, Stelly DM, Van Deynze A. A Long-Read Transcriptome Assembly of Cotton (L.) and Intraspecific Single Nucleotide Polymorphism Discovery. The Plant Genome. 2015 Jul 1;8(2)1-14.

Hulse-Kemp AM, Lemm J, Plieske J, Ashrafi H, Buyyarapu R, Fang DD, Frelichowski J, Giband M, Hague S, Hinze LL, Kochan KJ, Riggs PK, Scheffler JA, Udall JA, Ulloa M, Wang SS, Zhu QH, Bag SK, Bhardwaj A, Burke JJ, Byers RL, Claverie M, Gore MA, Harker DB, Islam MS, Jenkins JN, Jones DC, Lacape JM, Llewellyn DJ, Percy RG, Pepper AE, Poland JA, Mohan Rai K, Sawant SV, Kumar Singh S, Spriggs A, Taylor JM, Wang F, Yourstone SM, Zheng X, Lawley CT, Ganal MW, Van Deynze A, Wilson IW, Stelly DM. Development of a 63K SNP Array for Cotton and High-Density Mapping of Intra- and Inter-Specific Populations of Gossypium spp. G3 (Bethesda, Md.). 2015 Apr 22.

Hulse-Kemp AM, Ashrafi H, Stoffel K, Zheng X, Saski C, Scheffler BE, Fang DD, Chen ZJ, Van Deynze A, Stelly DM. BAC-End Sequence-Based SNP Mining in Allotetraploid Cotton (Gossypium) Utilizing Resequencing Data, Phylogenetic Inferences and Perspectives for Genetic Mapping. G3 (Bethesda, Md.). 2015 Apr 9.

Hulse-Kemp AM, Ashrafi H, Zheng X, Wang F, Hoegenauer KA, Maeda AB, Yang SS, Stoffel K, Matvienko M, Clemons K, Udall JA, Van Deynze A, Jones DC, Stelly DM. Development and bin mapping of gene-associated interspecific SNPs for cotton (Gossypium hirsutum L.) introgression breeding efforts. BMC genomics. 2014 Oct 30; 15(1):945.

Publications using the Cotton63KSNP array

Hinze LL, Hulse-Kemp AM, Wilson IW, Zhu QH, Llewellyn DJ, Taylor JM, Spriggs A, Fang DD, Ulloa M, Burke JJ, Giband M, Lacape JM, Van Deynze A, Udall JA, Scheffler JA, Hague S, Wendel JF, Pepper AE, Frelichowski JF, Lawley CT, Jones DC, Percy RG, Stelly DM. Diversity analysis of cotton (Gossypium hirsutum L.) germplasm using the CottonSNP63K Array. BMC Plant Biology. 2017 Feb 3;17(1):37.

Ulloa M, Hulse-Kemp AM, De Santiago LM, Stelly DM, Burke JJ. Insights Into Upland Cotton (Gossypium hirsutum L.) Genetic Recombination Based on 3 High-Density Single-Nucleotide Polymorphism and a Consensus Map Developed Independently With Common Parents. Genomics Insights. 2017, Vol. 10, p1-15.

Huang C, Nie X, Shen C, You C, Li W, Zhao W, Zhang X, Lin Z. Population structure and genetic basis of the agronomic traits of upland cotton in China revealed by a genome-wide association study using high-density SNPs. Plant Biotechnol J. 2017 Nov;15(11):1374-1386.

Sun Z, Wang X, Liu Z, Gu Q, Zhang Y, Li Z, Ke H, Yang J, Wu J, Wu L, Zhang G, Zhang C, Ma Z. Genome-wide association study discovered genetic variation and candidate genes of fibre quality traits in Gossypium hirsutum L. Plant Biotechnol J. 2017 Aug;15(8):982-996.

Search form