abstract
-
The ability to accurately predict complex trait phenotypes from ‘omics’ data are critical for the implementation of precision agriculture. However, prediction accuracy for most complex traits is a grand challenge due to the complex, linear and/or nonlinear genetics effects, their interactions and interactions of genotype by environment. Advancement of modeling enabled application of a few linear and nonlinear algorithms for the prediction of complex traits. However, there is a lack of systematic evaluation of the performance of these different algorithms. Here, with a range of traits with different genetic architectures and properties and genome wide SNP markers from a durum wheat association mapping panel, six linear models, three tree-based machine learning (ML) models and three network-based deep learning (DL) models were evaluated. We found accuracy of models is largely determined by the data property, and no single model outperformed others for all tested traits. Highest accuracies were achieved on pigment and cadmium, followed by yield and protein, and leaf spotting and maturation with lowest accuracy. ML models including Random Forest (RF) and XGBoost outperformed linear models for multi-environmental prediction, indicating their strength to take account of nonlinear effects on complex traits. DL models showed lower accuracy for most tests compared to ML and linear models. Exception of RF, fine-tuning hyperparameters is critical for the predictive performance of ML and DL models. These findings led to establishing the benchmarking of the genome prediction framework with both linear and non-linear algorithms to accurately predict complex traits in durum wheat breeding programs.