## 13.4 Stratified Sampling

• Random sampling does not control for the proportion of the target variables in the sampling process.
• Machine Learning methods may require similar proportions in the training and testing set to avoid imbalanced response variable.
• Stratified sampling is able to obtain similar distributions for the response variable.
• It can be applied to both, classification or regression problems.
• With a continuous response variable, stratified sampling will segment Y (response variable) into quantiles and randomly sample from each. Consequently, this will help ensure a balanced representation of the response distribution in both the training and test sets.
• rsample package can be used to create stratified samples.
• The following code demonstrates that on a dataset suitable for classification.
set.seed(999)
data("GermanCredit")  #credit risk data from the caret package
idx4 = initial_split(GermanCredit, prop = 0.7, strata = "Class")  #Class is the binary response variable
train4 = training(idx4)
test4 = testing(idx4)
# check the proportion of outcomes

prop.table(table(train4$Class)) #training set  Bad Good 0.3004292 0.6995708  prop.table(table(test4$Class))  #testing set

0.2990033 0.7009967