Storage vs. Analysis
Computer scientists and software developers who designed databases and data management systems, did so with the objective of minimizing disk space and making access to the data easy and fast. For example, Oracle database currently allows 26 field types to select from when defining a table. On the other hand, from the point of view of modeling, things are much simpler; there are three main variable types: nominal, ordinal, and continuous. These types reflect how we view the behavior or the meaning captured by each of these types.
One of the main steps during the phase of data acquisition and understanding is to decide how to interpret each of the variables within the three variable types.
These variables are the most basic and least flexible in terms of mathematical operations allowed on them. They represent a set of categories with no notion of order or distance between them. For example, the categories of “Residence Type” could be “House”, “Apartment”, or “Other”. We cannot establish any meaningful order between these categories or define a distance metric.
It is important to note that the values of nominal variables can be numbers or strings. But the numbers will not have the usual meaning. For example, we could have coded the “Residence Type” categories as “1”, “2”, and “3” and still not associate any order or distance metric between these values.
When we add the notion of order to nominal variables, they become ordinal. The categories of ordinal variables have order relationship but not a distance metric. For example, the levels of risk of default assigned to a credit card account may take the values “High”, “Medium” and “Low”. We implicitly add the assumption that High > Medium>Low. But we cannot say whether the distance between “High” and “Medium” is the same of that between “Medium” and “Low”.
These are the most common variables that allow all mathematical operations, including order and distance metrics, to be well defined. One can think of continuous variables as ordinal variables with the concept of distance and ratio defined on them.
Figure 1 summarizes the relationship between the above three variable types.
Figure 1: Summary of variable types
Other Variable Types
There are some other types of variables that can be derived from the above three basic types. This is usually done by imposing some constraints on the values or the allowed operations. For example, a variable that represents the age of a person would appear at first consideration as a continuous variable. But we usually impose the additional constraint that the age can only be positive. This additional constraint will result in additional data preparation steps to check that a mathematical operation on the age does not result in a negative age value.
Why do we care?
Identifying the most appropriate variable type for each field in the data is essential in determining the role that the variable would play during the modeling phase. For example, in a regression or a neural network model, nominal variables will need to be transformed into another format that will allow them to be used in such a model. In this case, one common approach is to generate dummy binary variables representing the unique categories.
Another common situation where the variable type matters is when we employ the weight of evidence (WOE) transformations. The monotonicity of the resulting values of the WOE comes into question only in the case of ordinal or continuous variables.
A final note we mention here, is that with some algorithms, the resulting model will depend on how we interpret the variable type. For example, a decision tree algorithm would provide different groupings and splits depending on how the algorithm would interpret the variable type. A good decision tree implementation, such as in Angoss’ KnowledgeSEEKER or KnowledgeSTUDIO, allows the user to define how the algorithm would interpret the variable type.