13.6 K-fold Cross Validation

  • k-fold cross-validation (aka k-fold CV) is a resampling method that randomly divides the training data into k groups (aka folds) of approximately equal size.

  • The model is fit on k−1 folds and then the remaining fold is used to compute model performance.

  • This procedure is repeated k times; each time, a different fold is treated as the validation set.

  • This process results in k estimates of the generalization error.

  • The k-fold CV estimate is computed by averaging the k test errors, providing us with an approximation of the error we might expect on unseen data.

K-fold CV in R

  • rsample and caret package provide functionality to create k-fold CV
set.seed(999)
# using rsample package
cv1 = vfold_cv(d_bhp[2], v = 10)  #v is the number of folds
cv1  #10 folds
#  10-fold cross-validation 
# A tibble: 10 x 2
   splits           id    
   <list>           <chr> 
 1 <split [588/66]> Fold01
 2 <split [588/66]> Fold02
 3 <split [588/66]> Fold03
 4 <split [588/66]> Fold04
 5 <split [589/65]> Fold05
 6 <split [589/65]> Fold06
 7 <split [589/65]> Fold07
 8 <split [589/65]> Fold08
 9 <split [589/65]> Fold09
10 <split [589/65]> Fold10
# using caret package
cv2 = createFolds(d_bhp$BHP.AX.Close, k = 10)
cv2$Fold01  #gives indices for 10 folds
 [1]  28  33  38  42  44  52  71  85  97 119 121 122 125 126 135 160 161 168 191
[20] 194 197 201 222 227 231 239 241 246 265 284 292 298 302 310 319 331 336 344
[39] 353 362 368 384 386 387 402 403 406 430 466 471 484 500 532 533 539 554 567
[58] 570 581 585 610 612 633 641 642