Sampling is used to extract a limited number of observations from a large population for the purpose of modeling or analysis. Sampling is primarily done for two reasons:
- Traditionally, large datasets did not fit into the memory of computers. And when they did, analysis programs ran slowly.
- Even when the entire data fits into the computer memory, it’s a good practice to try to go through the analysis process using a limited amount of data quickly and focus on debugging the process before trying to run the process using the entire dataset.
There is a third reason why sampling may be used. Most classification algorithms such as decision trees and regression do not perform well when the frequency of the target category of the dependent variable is small (searching for a needle in a haystack). In this case, we may want to use balanced sampling in order to zoom on the category of small frequency to allow the classification algorithm to pick up the pattern of these values.
Figure 1 shows a schematic of the three most commonly used methods of sampling.
Figure 1: Sampling methods
In random sampling, we select a subset of the population using random selection without replacement. The benefit of random sampling is that when we have a sufficient number of records, the sample would most likely resemble the population, as long as the sampling process is really random (or pseudo-random).
In balanced sampling, we over-sample the number records containing one or more of the categories of a variable of interest. The sample is drawn without replacement. In most cases, the variable of interest is the variable being modelled (dependent variable). This method is commonly used when the frequency of one of the categories of the target variable is small, such that random sampling would result in a sample with a very small number of records of that category.
Because balanced sampling results in a dataset that does not represent the population, it is essential to define a weight variable that will be used in all modeling calculations to scale the sample to the same proportions that exist in the population.
In stratified sampling, we split the data using the categories (or values) of one (or more) of the variables that have specific significance to the problem domain. The data slices are then called strata. Then, a separate model is developed for each stratum. The example in Figure 1 shows that we classified the data using the State (name) variable as the stratification variable.