You read the title and it piqued your interest enough to get this far. Bully for me. I got your attention.

Maybe “fails” is a bit strong. But I do believe, after performing program evaluations for executive education programs for a long time, that most evaluation designs leave a lot of opportunity on the table because they were never designed with analytics in mind.

Here’s what I mean.  If you’ve been to any kind of executive education – maybe a leadership program, required training on computer security, diversity training, or anything similar – you probably were asked to complete some sort of questionnaire or survey immediately after you finished. Call it an “end of program” evaluation.  The questions probably looked something like this:

  • Overall, how satisfied were you with your training?
  • Did your facilitators present the information in an effective way?
  • Was the content relevant to your situation?
  • Would you recommend this training to a friend or colleague?

This looks familiar, yes? Of course it does. These are bog-standard questions used on these evaluations. End of program evaluations, or EOPs, are often referred to as “smile sheets” and they only capture minimal data.

In the Kirkpatrick model – which many evaluators, myself included, use as a framework – there are four levels of evaluation.

Level 1: Reaction – The degree to which participants find the training favorable, engaging, and relevant to their jobs.

Level 2: Learning – The degree to which participants acquire the intended knowledge, skills, attitude, confidence, and commitment based on their participation in the training.

Level 3: Behavior – The degree to which participants apply what they learned during training when they are back on the job.

Level 4: Results – The degree to which targeted outcomes occur because of the training and the support and accountability package.

Far too many EOPS or smile sheets only get to that first level, largely because it takes place immediately after the training ends. The participants haven’t had time to apply the things they learned, and the evaluation may not even have assessed what they learned. Getting to level 2, 3 and 4 isn’t likely to happen on these evaluations, but that doesn’t mean you still can’t get valuable information from the first two levels on the end of program evaluation.

In addition to the levels of evaluation used in the model, there are also levels of analysis you can apply to get more from the data you capture. In my own personal model, I use when I design evaluations, it looks something like this:

Level 1: Cursory Review – This is when an evaluator or other stakeholder just does a quick look at the data to confirm things look “good” overall. As one old colleague of mine used to say, “if I see more fives than fours, I don’t need to see anything else.”  I’m not going to lie. I died a little inside over that comment. This isn’t even data gathering, let alone analysis.

Level 2: Simple Descriptives – This is the most common sort of analysis done with evaluation data. The mean score on items or scales, maybe a distribution of scores, and some percentages of choices made on multiple choice items. You often see this in client reports or even in dashboards if those are in use. A side note here, if you’re an evaluator with these kinds of repetitive programs, dashboards really should be used. Now you’re getting data and some information. You might also do some simple work to review and examine text responses, like a quick word cloud.

Level 3: Inferential Analysis – This is a level of true analysis, where you’re no longer just looking at numbers, you’re looking at significance. This type of analysis allows you to do nifty stuff like compare cohorts or different training methods (i.e., virtual vs. face to face) and see whether they are truly different, and to what degree.  You can’t use this all the time, but many evaluators don’t do it even when they can. This is when you truly shift from just data gathering to information gathering. In addition, you might start doing more in-depth text analysis, like topic coding. This is also the type of analysis that requires some advanced skills in statistics.

Level 4:  Advanced Analytics – This is the gold standard and the true shift from information gathering from those EOPs to gaining knowledge. This might include predictive analytics of some kind and more robust text analytics, such as topic modeling combined with sentiment analysis. Using this type of analysis leads to greater learning from your evaluations…but it requires some deliberate design choices in your evaluations.

How would you get to that last level of analytics? I’ll explain using an example I often employ when I teach my graduate students research methods.

Let’s say you own a restaurant, and as the owner, you naturally want to know how your customers experience your establishment. Is it a quick bite on a business trip? And for those of us misanthropic people who really don’t like to spend time around others, “quick bite” is often all we’ll tolerate. Or it could be a family affair with 20 people, each experience is different, but equally important to understand.

If you’ve ever filled out a comment card at a restaurant or maybe from an online order, you know the usual routine is something like this:

  1. On a scale of 1-10, how would you rate your overall satisfaction with this visit to our restaurant?
  2. Is there anything you’d like to tell us about your experience? (open text).

Often, that’s all you get. So, the owner can roll up that data and get an average score of 1-10 and read some comments. If they’re a little more sophisticated, they might generate a standard deviation and do some simple text analysis like the word cloud I mentioned earlier. Most often they don’t – and can’t – go beyond level 2 analysis because the evaluation design doesn’t allow level 3 or 4 analysis.

When you get high scores, you may think that is great. And it is, but you don’t know anything beyond that score. You don’t know why you got it.

A concept like “satisfaction” is complex. In our restaurant example the things that might impact satisfaction include:

  • Taste of the food
  • Portion size.
  • Quality of the food
  • Price
  • Service
  • Atmosphere
  • Presentation
  • Wait times.

Think of satisfaction as a concept you want to measure, and those items in the list are indicators of that concept. Combined, they help you understand or “explain” satisfaction. After looking at that list and the items that impact satisfaction, just knowing a single number with the label “satisfaction” doesn’t tell you much.

You could dig through the comments, but there are two issues with that approach.

  • First, it’s quite common for people to not leave a comment at all and often the comments are from those customers who are either very unhappy, or very happy, which is a small portion of the whole.
  • Second, it takes expertise to do quality text analytics, and unless it’s a large chain with a dedicated analyst, it’s unlikely that skill is readily available.

A more data-savvy owner might very well ask about all those items we listed, having diners rate each on a 1-5 scale. They can then get average scores for all aspects impacting the experience and better understand what’s going on in the restaurant. For example, what would this data tell you?

Item Score out of five
Taste of the food 4.2
Portion size 2.5
Quality of the food 4.3
Price 2.1
Service 4.8
Atmosphere 3.0
Presentation 4.5
Wait times 4.0


If you gather from this that those diners like the taste, quality, and presentation of the food, but are not happy with the price and portion sizes, you’re on the right track. And certainly, this is miles beyond what you get with a simple “how satisfied are you” question. This is a step in the right direction and lands you solidly in the second level of analysis.

You can do still more, and quite a bit more.

If you include as an “outcome” variable a single question about overall satisfaction, or a typical net promoter score question (a question about the likelihood of recommending the restaurant, on a scale of 0 to 10), you now have what you need to do some truly interesting analytics. With that additional step you can perform a relative importance analysis, or what is sometimes called a driver analysis.  This is the type of analysis you would do for level 4.

The idea is that each of those items in the table are drivers or “predictors,” and what they drive or predict is the fluctuation in that overall satisfaction or recommendation item. This happens because each person’s ratings on the items are related. If a person gives you high ratings on the drivers, they will likely give high ratings on the satisfaction question, and this is what makes this type of analysis work.

Fluctuations in the individual “drivers” help predict fluctuations in the “outcome.”  You can see which factors “drive” satisfaction, and more importantly, by how much. This is a much more effective method of understanding what factors drive satisfaction. With this information you know where to focus your energy and effort.

Here’s an example of what you might see.

Overall Satisfaction R2 = .75
Taste of the food .21
Portion size .07
Quality of the food .18
Price .22
Service .14
Atmosphere .06
Presentation .02
Wait times .10


I recognize this may look intimidating. But it’s actually quite simple.

The first number – the R2 – is the total percentage of variance in the overall satisfaction score – the “outcome” – that can be explained by all the other “predictor” variables. The remaining .25% is unexplained or caused by other factors we didn’t measure. The good news is that 75% is quite high, and you shouldn’t be concerned if you can’t account for 100% of the variability in the outcome.  I’d be very surprised if you could.

The other numbers show how much of that 75% is explained by each driver. In this case, the main drivers of satisfaction are the taste of the food (21%) and the price (22%). Now the owner understands the two factors that are most important and where to focus energy and effort.

Congratulations, you’ve just learned a semester’s worth of statistics in about two minutes. Neat huh?

This type of analysis can be easily applied to program evaluation, and at Talent Dimensions that’s exactly what we do. Our evaluations include various drivers, from learning objectives and facilitator ratings to eagerness to attend and overall perceived value and applicability. By doing this analysis using the NPS item as the outcome, we know with much more clarity what exactly drives success in our programs and where to focus our efforts. It allows us to review large amounts of evaluation data and determine in detail what aspects of training are most important to driving value.  This includes application, the relevance of the content and the ability of the facilitator to effectively explain and connect that content to existing work challenges.

We’ve compiled detailed data on all these factors by leveraging a standard set of evaluation questions.  At times we do ask additional or slightly different questions to conduct hypothesis testing on other factors that might impact successful training, but those situations are the exception because most of this analysis can be done with standard evaluation data. If I can do it, well, any evaluator should be able to do it, too.

If you are an evaluator, don’t just think about those Kirkpatrick levels when you design your evaluation. The levels are a great place to start, but then think about what you want to know, what analytics you can use, and build the evaluation accordingly. You will be amazed at what you can learn and use to better your training experiences and improve the impact on your participants.