TAMU CottonSNP63K Array
High‐throughput genotyping arrays provide a standardized resource for plant breeding communities that are useful for a breadth of applications including high‐density genetic mapping, genome‐wide association studies (GWAS), genomic selection (GS), complex trait dissection and studying patterns of genomic diversity among cotton cultivars and wild accessions. The CottonSNP63K, an Illumina Infinium array contains assays for 45,104 putative intra‐specific single nucleotide polymorphism (SNP) markers for use within the cultivated cotton species Gossypium hirsutum L. and 17,954 putative inter‐specific SNP markers for use with crosses of other cotton species with G. hirsutum. The SNPs on the array were developed from 13 different discovery sets that represent a diverse range of G. hirsutum germplasm and five other species: G. barbadense L., G. tomentosum Nuttal ex Seemann, G. mustelinum Miers x Watt, G. armourianum Kearny, and G. longicalyx J.B. Hutchinson & Lee. The array was validated with 1,156 samples to generate cluster positions to facilitate automated analysis of 38,822 polymorphic markers. Two high‐density genetic maps containing a total of 22,829 SNPs were generated for two F2 mapping populations, one intra‐specific (~7,000 loci) and one interspecific (~19,000 loci), between which about 3,500 SNP markers are shared. The intra‐specific genetic map is the first saturated map that associates into 26 linkage groups corresponding to the number of cotton chromosomes for a cross between two G. hirsutum lines. The linkage maps were shown to have high levels of collinearity to the JGI G. raimondii Ulbrich reference genome sequence. These provide the global cotton research community a valuable new resource.
Objective 1: to utilize recently identified SNPs to develop a standardized large-scale SNP genotyping array for cotton;
Objective 2: to evaluate the performance and reproducibility of the array on a large set of samples;
Objective 3: to develop a cluster file that can be used to help automate genotyping for allotetraploid cotton;
Objective 4: to produce high-density linkage maps based on two biparental F2 populations in an intraspecific cross (G. hirsutum × G. hirsutum) and an interspecific cross (G. barbadense × G. hirsutum).
Completion of these objectives will enable reliable standardized high-throughput SNP genotyping on large cotton populations for many diverse applications, including high-density genetic mapping, genome-wide associated studies, genomic selection, complete trait dissection, and studying patterns of genomic diversity among cotton cultivars and wild accessions. Online available information on the array-based genotyping via CottonGen will be used as a community supported database.
Total 1,156 lines or samples of 16 different types (Table 1) were used in the development. Seeds for each line were planted in peat pellets (Jiffy, Canada) at Texas A&M University, CIRAD/EMBRAPA, USDA-ARS-SPARC, USDA-ARS-SRRC, or CSIRO greenhouses. Plants were allowed to grow until first true leaves were available. Young true leaves were sampled and extracted using the Macherey-Nagel Plant Nucleo-spin kit (Pennsylvania) according to the manufacturer’s instructions. All DNA samples were quantified using PicoGreen and then diluted to 50 ng/µl. The samples included reference lines from the various SNP development efforts, duplicated DNA samples, individual plants and plant pools from the same seed source, individual plant samples from different seed sources, parent/F1 combinations, segregating samples from wild Gossypium species, inbred cultivar lines, wild G. hirsutum lines, and two mapping populations, one intraspecific and one interspecific (Table 1). This material represented samples from most of the cross-compatible range of G. hirsutum. Mapping samples included 93 F2 lines from a G. hirsutum cv. Phytogen 72 × G. hirsutum cv. Stoneville 474 population designated as “PS” for intraspecific mapping and 118 F2 lines from a G. barbadense doubled haploid line 3-79 × G. hirsutum cv. Texas Marker-1 population designated as “T3” for interspecific mapping. Samples for the mapping populations were chosen randomly from germinated seed within each population.
Leaf tissue or high-quality DNA samples can be provided for each cotton sample for processing on the CottonSNP63K array. Samples are processed through a three-day pipeline to provide raw fluorescent image data for all 63K functional assays on the array. These raw files are processed through GenomeStudio software with the developed cluster files to allow for automated genotyping of the indicated cluster positions. The Cluster file can be found HERE. A second cluster file is being developed for enhanced automated genotype calling of SNPs associated with diploid germplasm introgressed into G. hirsutum (contact Dr. David Stelly for details).
To produce standardized genotype output for samples that can be deposited into the CottonGen database, please follow the following procedure in GenomeStudio.
This will ensure that protocols and reporting formats are identical across groups and genotypes can be directly compared in the database. (There may be some cases for which you may wish to manually adjust cluster positions for polymorphism that may be specific for your population that were not represented in the samples utilized in the cluster file development effort. If you find this to be the case, please alert the database curator, Dr. Jing Yu, when submitting genotype data.)
Standardized Genotype Output
To import Cotton Array Data in GenomeStudio
You will need the sample sheet you intend to use for the project, the iDAT files for the samples included in the sample sheet, the cluster file available HERE, and the SNP Manifest file from Illumina. Genotyping Projects using the developed cluster file should utilize a GenCall cutoff standard of 0.05, to allow for accurate calling of polyploid type loci that exhibit nearby clusters.
Step-by-step instructions for generating a New Project in GenomeStudio can be found HERE.
To export Cotton Array Data from GenomeStudio
You should get a report that is similar to this (head of file):
GSGT Version 1.9.4
Processing Date 5/7/2015 4:34 PM
Num SNPs 63058
Total SNPs 63058
Num Samples 7
Total Samples 7
s1 s2 s3
For inclusion in the CottonGen Database, two letter genotypes have been converted to single letter using IUPAC notation.
Table 1. Samples included for array validation and cluster file development (Hulse-Kemp et al. 2015)
Table 2: Datasets utilized in intraspecific content design on the CottonSNP63K array (Hulse-Kemp et al. 2015)
Table 3 Datasets utilized in interspecific content design the CottonSNP63K array (Hulse-Kemp et al. 2015)
Figure 1. Overlap of SNPs among species (Hulse-Kemp et al. 2014)
SNP markers shared across five species included on the CottonSNP63K array from TAMU/UC-Davis Inter RNA-seq discovery set The overlap and specificity of the Class I and Class II SNPs for G. barbadense cv. 3–79, G. tomentosum, G. mustelinum, G. armourianum, and G. longicalyx.
Genotyping with the array
DNA standardized at 50 ng/µl for each of the cotton lines described above was processed according to Illumina protocols and hybridized to the CottonSNP63K array at Texas A&M University or CSIRO. Single-base extension was performed and the chips were scanned using the Illumina iScan. Image files were saved for cluster file analysis. All image files were uploaded into a single GenomeStudio project containing 1156 individual samples. Of the 70,000 SNPs targeted for manufacture on the array, 6942 markers failed to meet standards for bead representation and decoding metrics during the array construction process at Illumina and were removed from the manifest. Data for the remaining markers were clustered using the GenomeStudio Genotyping Module (V 1.9.4, Illumina, Inc.). All markers were viewed and manually curated, taking into account the sample type and known segregation ratios, for construction of the best cluster file for genotyping tetraploid cotton (available at HERE).
Illumina Infinium Platform
The Infinium platform detects SNP alleles by adding a fluorescence-labeled allele-specific nucleotide via single-base extension and subsequent detection of the fluorescent color. The Illumina Infinium chips are sophisticated silicon-based array devices. If you are interested to learn about the technology that drives the Infinium genotyping arrays, watch the following video:
Types of call frequency of SNP markers
Types of call frequency of SNP markers. NormTheta or relative amount of each of the two fluorophore signals is plotted on the X-axis, whereas NormR or signal intensity is plotted on the Y-axis. (A) Failed marker with call frequency = 0. (B) Call frequency 0.500–0.990 with major sample deviations. (C) Call frequency 0.990–0.999 with few uncalled samples. (D) Call frequency = 1 with all called samples. (E) Distribution of call frequencies for all SNP markers on the array.
Classification of scorable SNP markers according to Illumina GenTrain score
Figure 3. Classification of scorable SNP markers according to Illumina GenTrain score. NormTheta or relative amount of each of the two fluorophore signals is plotted on the X-axis, whereas NormR or signal intensity is plotted on the Y-axis. (A) Monomorphic marker. (B) Intergenomic or homeo-SNP marker. (C–F) Classification of polymorphic markers based on Illumina GenTrain score. (C) Genome-specific marker representing a single polymorphic locus with GenTrain score >0.60. (D) Marker with GenTrain score 0.30–0.59 on half the plot representing two genomes, one monomorphic and one polymorphic locus. (E) Marker with GenTrain score 0.21–0.29 representing multiple monomorphic loci and one polymorphic locus. (F) Marker with GenTrain score less than 0.20 representing many monomorphic loci and one polymorphic locus. (G) Distribution of cluster types in polymorphic markers based on GenTrain score.
Publications using the Cotton63KSNP array