What grade of data are you using?



You are driving down this snow covered rural road. You keep looking at your fuel gauge. The needle is getting really close to the "E" warning marker. The information display is about to yell "LOW FUEL LEVEL". Your anxiety level is rising as the sun disappears below the horizon. You need gasoline now! You drive several more miles when, suddenly, a general store appears out of nowhere on the right, and they have a fuel pump! You pull in, park your vehicle and get out. As you approach the pump, you notice there is only one grade of fuel available, and the octane rating sticker is unreadable.

I hope you never get in a similar scenario, but if you did, how comfortable would you be in this situation? If you drive a Diesel vehicle, probably not very. Or perhaps the only fuel available was Diesel? We need the right fuel for a given vehicle. Similarly, we need the right data for a given machine learning model.

The VISUAI Index

Back in 2017, I added a global data index to the suite of tools I use for Dion Research. Just like an octane rating (RON + MON / 2) gives us an overall value to help us select fuel that is appropriate for our vehicle, the VISUAI index allowed me to make sure the data met a minimum quality standard to build models and to provide me with some further detailed metrics to let me know the types of issues I would face when training, testing and deploying my data driven models and applications.



Early in 2018, I started licensing VISUAI (a SaaS web based platform) to companies. You can learn more about this AI driven data quality, risk auditing and anomaly detection tool by visiting visu.ai, through dionresearch.com or by contacting me on LinkedIn to book a demo. 

Once licensed, you can also securely share the VISUAI index score card with customers so they feel better about the solutions you are building, and with your partners, so they understand the importance of sending you quality data.

But why is data quality so important to data driven initiatives?

Garbage In, Model Out?


Just recently, I presented the talk "Garbage In, Model Out?" at DataSciCon.tech in Atlanta. Besides introducing the audience to various data quality metrics, such as bias, risk, anomaly and sparcity, I demonstrated the effects of bad data quality on steps further down the data science pipeline. Many people were surprised by how sensitive the feature engineering step can be to bad data.

Even more importantly, I demonstrated how models can quickly become Jekyll or Hyde, based on imputation techniques for missing values or very small changes in the validity of the data.

On any given day ...



Do you even know how much invalid data you process or with which you build charts, reports or models? "I feel pretty good about my data quality", I hear you say. "It has to be 95% good data".

If you are not measuring, and specifically KNOW what you are measuring, you are guessing. And your guess is more than likely off by a large measure, assuming you are measuring the right thing (what are you measuring?).

... 1/10th of a percent is more than enough


Going back to the talk I presented at the DataSciCon.tech conference, I demonstrated a data science pipeline to predict house prices using at most 9 features, using a well understood and easily explained Linear Regression model (with some feature engineering). The data was measured for sparcity at 0% (in other words, 100% complete) and validity at 100%. Features and coefficients were noted, and explained fairly well what we would expect for real estate (location, in particular, the further away from peak prices / downtown, the lower you can expect the prices to be, and if your property is waterfront, prices are drastically higher).
lat_from_center_eng_   -1268195.034
long_from_center_eng_   -218764.288
bathrooms                 30786.124
condition                 52972.204
view                      69006.872
grade                    113730.003
renovated                119986.357
sqft_living_log_eng_     199563.355
waterfront               600740.084
The same pipeline was then run using 1% of invalid data on the longitude, before any feature was engineered or selected to build the model. The results are quite different. While latitude retains its position, new features are selected by the same process, and longitude is not in the top 9 features. Feature coefficients are also different and the model will output results that are quite different from the original model.
lat_from_center_eng_   -1301388.179
floors                   -42421.114
bathrooms                 42414.199
condition                 50044.539
view                      69908.741
grade                    118055.693
renovated                126089.640
sqft_living_log_eng_     178092.278
waterfront               601941.993


The same holds true at 1/10th of one percent, and at 1/100th of one percent! More than that, in all these cases, if we tried to run the model built on the original features, expecting the original features but receiving only the 9 features identified from new data with some invalid longitudes, the model would error out.

Measure and visualize, do not guess


The point of this exercise was not to say that 1/100th of 1 percent validity issues will always be a problem. In this case, longitude was a critical data element. In another scenario, it might be 1%, 5% or 10%. No, the real point was to illustrate that data quality metrics (along with model metrics) and visualizations, have to constantly support any data science pipeline, especially a production pipeline.

Don't feed any data you found to your pipeline if you are not measuring it, just like you don't fill the gas tank of your vehicle with any unidentified liquid that you stumble upon.

Francois Dion
Chief Data Scientist
@f_dion

Comments