Data Preparation is the backbone of any analysis and many varied data preparation procedures are available to access and shape data into an appropriate representation for modelling or reporting.
Data Preparation is, as those who are involved will attest, a time consuming task. An array of figures have been quoted to reflect the proportion of total project time associated with data preparation; soaring to a high of 80-90% (Mayo, 2016).
The benefits of investing such time into data preparation cannot be understated, however in this day and age of convenience, speed and not bothering to take the time to read manufacturers’ instructions … there is a need to be explicit.
To convey the importance of data preparation, consider the simple analogy of painting a room.
For those of us who are not professional painters and decorators it’s a bit of an adventure:
- Get the paint
- Paint it!
These two steps may suffice. However, given this dive-in approach, some areas that required sanding that were previously ignored may now be prominent, some holes that should have been filled in continue to be evident. The dust and cobwebs that were not removed from the corners are painted over, subsumed, but to the trained eye stick out like the proverbial sore thumb.
The outcome can be boiled down simply: results may vary, and there may be a need to do the whole lot again!
Now, consider the approach a pro takes. For example, to paint a room the pro will:
- Remove all furniture
- Remove cobwebs from corners and rub down the walls
- Use dust covers for flooring
- Check everything
Only when these elements have been considered and addressed can painting begin. Of course, the implications are profound:
- Taking the time to prepare the room ensures a better awareness of its characteristics; walls, unevenness, holes, areas to be cautious with, areas that require a little more attention, etc.
- When the time comes to apply the final coat, given the preparation applied, the process should be smooth and even and results should be as expected with no surprises … as any will have been dealt with.
Relating this to model building or reporting should not require any gigantic leaps and reflects that data familiarity leads to better understanding and accuracy.
To be more precise; the act of data preparation and assessment to ensure correct joins, aggregations, histories, the creation of new fields, identification of values to process and ultimately to build a useable view, is paramount to success for modelling or reporting.
Short-cutting these processes is possible but generally leads to poor data understanding, under-performing models, and inaccurate reporting.
A direct consequence of shortcutting data preparation processes when developing predictive models is the immediate assumption that the modelling technique is wrong, not good enough, and a ‘better’ alternative to predict records to a better degree, should be sought.
This characteristic is implicitly a phenomenon known colloquially as:
The big black button is a panchaea. You only need to think of what you desire, close your eyes and press it, and it will perform all necessary data preparation, modelling and reporting aspects … only problem is, it doesn’t currently exist … BUT, this does not stop people assuming it’s here!
The point here is that if an adequate degree of data preparation has been performed on a dataset, there should be comparable accuracy across modelling techniques – if a modelling technique is not performing well, then an alternative technique may well perform equally as poorly!
Again, the underlying issue is with data familiarity and data preparation, not the modelling technique chosen.
The takeaways from this little stroll down data prep street are straightforward and hopefully shed a little light on the relevance and primacy of adequate data preparation as an avenue to better data understanding which should lead to more accurate modelling and reporting.
The fact remains that adequate data preparation is a time consuming task but will implicitly lead to a better understanding of data and consequently more robust models and reporting.
Mayo, M. (2016, Oct). KD nuggets. Retrieved from KD Nuggets News: