abstract
-
In genome-wide prediction, models are evaluated using different cross-validation (CV) methods. Such methods provide estimates for accuracy which sets expectations for likely genetic gain for genomic selection (GS). Conventionally, CV are often done by randomly dividing the data set with n cases into a number of folds (K) followed by using K-1 folds in model training and remaining fold as test set. Such schemes, many times, not properly utilize information such as environmental (e.g. years, locations, seasons, management practices or cycles) or relatedness information (e.g. populations or families). As result the accuracy estimations are biased which can lead to surprises in amount of genetic gain achieved. In presence of high Genotype x Environment (GXE) and population stratification, across year, location, cycle or population predictions in plants are many times very low due to different issues such as shifting relatedness, differences underlying QTL and QTL environment effects. Cross population/year/location/cycle prediction accuracy decreases as proportion of shared QTL decreases. In addition, likelihood of shared QTLs decreases when the populations or environments are distantly related. One of difficulty to address is that the effective QTLs are very difficult to detect in known (e.g. for a location or population) or forecast in unknown (eg. unknown year) circumstances. For accurate accuracy estimates, we need to consider magnitude of GXE or population stratification while performing CV. In this study using simulated and real data we have discussed different CV schemes and their potential usefulness in design of genomic selection experiments.