The Rationale for Sampling
Sampling is used to extract a smaller dataset from a large population of data for the purpose of analysis and modeling. Working on a small dataset, as opposed to the whole population, allows us to use computers with fewer resources, such as RAM and disk space, as well as performing the analysis faster.
Sampling is a well-established technique that has been around for more than 100 years. In recent years, big data platforms have been developed and experienced fast adoption. With these platforms the call for using all the data and abandon sampling is growing stronger. The justification for this attitude is that business believes that if we use all the data we will not miss on the small ripples of interest in the data that may be smoothed or ignored when we use sampling. I have two thoughts regarding this argument:
- Very often we are not really interested in the small ripples, we are only interested in the overall trend. If we model the small ripples in the data, the models will over-fit and will not generalize. This means that the model will not be robust.
- We are interested in the small ripples, because we are searching for rare events. In this case, there is a good argument against sampling.
In common applications of predictive modeling in credit scoring and marketing, we are only interested in learning the trend and sampling is still a valid option. However, in some special applications, such as fraud analysis, we are more focused on rare events than the overall trend and we should take advantage of the big data platforms.
There is, however, another justification for the Big Data platforms. The argument is based on the need of organizations to have one version of the truth! Currently, the data in most organizations are scattered over many data warehouses, data marts and small databases and files across the functional areas of the business. Therefore, it is not uncommon that different departments produce different reports on the same set of metrics. This vision of Big Data repository, such as data lakes, is to centralize all the organization data and all the organization operations as well as analytics run on it. This vision also results in better security and tighter control on data access across the organization.
Integrity checks and sample validation
Once a sample is drawn from a population, the first task the analyst must perform is to conduct some integrity checks and ascertain that the sample represents the population. These include the following key items:
- Range of values and distribution: the range of values and their distributions in all the variables in the sample should be similar to that in the population. Specific attention should be given to nominal variables; where all the distinct categories of the variables that exist in the population should be represented in the sample. There are many methods to test the similarity between the distributions of values. We will dedicate a separate article to the subject.
- In addition to testing each variable on its own, we should also test that the sample preserves the relationships between the variables. Correlation analysis and decision trees are commonly used to do this.
A frequently asked question is about the minimum sample size. There are many formulas that provide the minimum sample size to fully capture the statistical properties of a population. However, these formulas are restricted to simple cases of one or two variables of restricted types. Unfortunately these formulas do not help with the real datasets encountered in business applications containing hundreds if not thousands of variables of mixed types. In this case, one has to rely on experience and try-and-error to find an acceptable value for the sample size.
The good news is that we don’t need to validate the integrity of the sample in terms of all the variables. We can validate the sample using only the variables of interest. These are the variables that will end up being used in the model(s) and to create the deployment strategy. Therefore, one pragmatic approach to sample validation is to draw the sample, build the model and find the variables that will be used in modeling, scoring and reporting and validate the sample using only these variables.