Success in building a predictive model often hinges on your data.  A central element of the data concerns how it, and the resulting prediction, is framed around the business challenge – the outcome you’re interested in forecasting.  As we apply machine learning to empirically identify the the drivers of outcomes in our historical data, our findings are often governed by the applicability to current data.  Is the past relevant to today? Is that data representative?

We’ve witnessed our customers apply several strategies to create robust training data sets (i.e historical data) and therefore maximize their live experience applying these models to real-world business challenges. Let’s explore a few of those tactics and key questions in approaching predictive analytics.

Think about your prediction problem today. What do you know?

Thinking about what you know today is a good proxy for what you need to capture in your historical dataset. If attributes of your data existed historically but are no longer part of your data collection, it likely shouldn’t be used.

When do you know it? /  Is your data timestamped?

This is among one of the more critical considerations in preparing data for a predictive problem so we can better avoid look-ahead bias. We are likely to have observed data at different points, with different timestamps, historically.

Framing a forecast around what was known at that time is key, creating a true point-in-time dataset.  It reflects that data available to a decision maker at that exact point in time. As we train a model, a machine takes the place of this hypothetical decision maker and learns the set of attributes that led to positive outcomes.

Framing the problem in this way, gives us a reflective representation of the data if we were placed in that period of time historically.  As such, it helps mimic our current environment and makes our historical learnings more robust and applicable to today.

As an example, if we’re building a customer retention model, knowing our customer’s attributes as of 12/31/2014 and that they canceled on 6/30/2015, we can’t use data (fees, payments, service calls, offers, etc.) that occurred from 1/1/2015 – 6/30/2015.  We wouldn’t have known those data points at the point in time we were interested in making a decision or forecast (12/31/2014).

This extends to attributes that aggregate data over time such as averages (it should only aggregate to that point in time) as well as class variables. For the latter, whether or not the customer is receiving our monthly newsletter might be an attribute to help predict retention.  But, unless we know when they started to receive it, it’s likely to yield inaccurate results potentially due to look-ahead bias. If they started to receive it on 3/31/2015, and we didn’t timestamp that event, our use of this variable would assume that they were receiving our newsletter on 12/31/2014.  Because they canceled, naïve algorithms might assume a negative relationship between newsletter subscriber and customer turnover.

Rest assured, many companies applying predictive analytics are grappling with similar issues. In the near term, you may identify data points that would be useful for predictive modeling, but without proper time context, can’t be used.

It may naturally prompt better data collection initiatives, and this is positive—predictive models are a perpetual work-in-progress, and there’s no bigger input to quality than the data underlying a model.

Determine at what point you could take action from a forecast.

Do some inputs compile at month-end? Are different data points available at different times? Data is frequently known, available, and therefore actionable at various points in time.  Think through the optimal point to “strike” your data and take action today.  Knowing that, construct your historical training data in the same manner.  Waiting longer for key data points delays our ability to act; acting earlier without key data points can be just as perilous.

Also think about the window of time into which your forecast is set. Can you take the same actions on a forecast over the next week versus one that is expected to occur over the next year? If you can’t mobilize within a week, then that prediction timeframe is of no use since you can’t act on it.

Consider extending out the forecast horizon to be more attuned to the base churn rate.

Use the historical base rate as a guide. If you build an employee churn model with a forecast horizon of 1 day (i.e. who’s likely to resign tomorrow?)  when your annual turnover rate is <10% (ie the base rate), the forecast horizon and natural tendency of churn are somewhat incompatible.  Turnover is so infrequently observed on a daily basis that identifying its drivers becomes difficult.  The model is unlikely to have meaningful levels of accuracy, as a result.

Framing data around the outcome you’re hoping to model is a key component to building a predictive forecast. When constructed properly, it allows you to understand what drives those outcomes and therefore areas to focus on to positively alter those outcomes.  The tactics and strategies outlined will go a long way to ensuring success in this endeavor.  If you are interested in learning more or if you have any questions, please feel free to contact us today at Big Squid.