Final Product: The Metrics Report Card.
Chapter 9 introduced the metrics I developed to answer the leadership"s question for our organization. If only the service provider and the executive leader were to be viewers of the results, I could have moved directly to publishing the metric. But since the metrics would go through rigorous review throughout the management chain, and also be seen by customers, I had to find a way to make the results usable (if not readable) at each level.
I worked with a team of consultants and the service providers to develop what became the Report Card. This means of interpreting the metrics allows for it to be viewed at different levels, and at different degrees of aggregation, while still preserving the concept of expectations.
Now that basic measures were developed, we (the team of consultants- the service providers and me) needed a way to make them into a Report Card. What we had was a set of charts. But we needed to find a way to report on the overall service health, while still preserving the individual measure categories of Delivery, Usage, and Customer Satisfaction. Being at an academic inst.i.tution, it seemed logical to use the Report Card concept as a template.
In the case of a Report Card, students can take totally disparate courses-everything from forensic anthropology to fine art printmaking. The grades obtained can be based on totally different evaluation criteria, but the grades are still understandable. An A means the student is exceeding expectations, doing very well; a B means doing well; a C means the student is surviving; below a C means the student is failing to meet expectations. Quizzes, tests, research papers, and cla.s.sroom partic.i.p.ation can be used to make up the grade. Other less normal evaluation methods can be used-like reviews of art produced, presentations to committees, and panel reviews of materials produced in the course of the cla.s.s.
In all cases, the student gets a letter grade that can be transferred to a number for grade point averages. We wanted something similar that a high-level stakeholder could glance at and grasp immediately.
We settled on E, M, and O. We had to figure out how to evaluate and determine each measure and piece of information as either Exceeding Expectations, Meeting Expectations, or an Opportunity for Improvement. Availability was in the form of percentage (of abandoned calls or abandoned calls less than 30 seconds). So we could smoothly transition this to a common measurement view.
Delivery.
Remember, for this discussion I"ve broken delivery into "availability," "speed," and "accuracy."
Availability.
For each measure, expectations have to be identified. Table 10-1 shows the expectations for Availability.
This made the results simple for a.n.a.lysis. Placing the measures into the grid would give us a "grade" of O, M, or E. This could then be used to develop a grade point average. We could roll up the Grades at the Availability level. We could also roll up the grade at the Delivery level (Availability, Speed, and Accuracy together). And hopefully we would be able to roll up the grade to an overall Service grade.
The first decision we had to make was how to roll up the two different grades. We opted to give each grade (E, M, or O) a value. It was important to us to have our results be beyond reproach. Since we knew errors might seep in from many different quadrants, we had to ensure our intentions were never in question. Trying to make what could be a complex problem into something manageable, if not simple, we worked off a 10-point value scale. An E was worth 10 points; an M was worth 5 points; and an O was worth zero. We then averaged the numerical values. So if we had two values to use, as in the case of Availability we would get the following: E = 10, M = 5, O = 0.
An E and an M averaged 7.5 points An E and an O averaged 5 points An M and an O averaged 2.5 points Now we took the calculated grade and turned it back into an evaluation against expectations. A grade of 8 or greater would be an E. A grade between 5 and 8 would be an M. A grade below 5 would be an O. So, another way to look at it is as follows: An E and an M averaged 7.5 points, which was an M An E and an O averaged 5 points, which was an M An M and an O averaged 2.5 points, which was an O We liked the way this worked. You had to exceed in a measure to balance out an Opportunity for Improvement. We felt this was in the spirit of "erring on the side of excellence" as my friend Don would say.
Err on the side of excellence.
A possible drawback was it made the Opportunity for Improvement a "bad" thing and the Exceeding Expectations a "good" thing. As I covered earlier, they are both anomalies. What we want are Meets Expectations. Another drawback was that the combined grades could hide anomalies. If you had an equal number of Es and Os it would roll up to Ms and look like you were doing just what was wanted-meeting expectations.
The positives were that we could show an overall grade, giving a "feel" for the health of the item. Another positive was that we could deal with the "hidden" grades by simply flagging any Ms or Es that had Os buried in the data. That would allow the metric customer to know where to dive deeper to find the Os and see what was happening in those cases.
The bad vs. good thing we could not overcome as easily. In the end we decided to deal with it on a case by case basis, ensuring that we stressed that both were anomalies. We accepted this because, no matter how much we stress the negatives to Exceeding Expectations, in the end, the fact that expectations were exceeded wasn"t in itself a bad thing. It was only "bad" based on how you achieved the grade-like if you neglected other important work/services or applied too many resources to attain it. But if you achieved this grade by a process improvement or simply focusing properly on a different area, it was not only a good thing, but we could change the customers" expectations because we would be able to deliver at this higher level consistency. It could become a marketing point for our services over our compet.i.tors. Opportunity for Improvement, the other anomaly, couldn"t be said to be the same. If you failed to meet expectations, most times the fact that you weren"t meeting the customers" expectations made it a bad thing. Even if you found that it was due to natural disasters or things out of your control, the customer still saw it as a negative. So while I wanted both anomalies to elicit the same response-further investigation, the purpose of that investigation was clearly different.
One investigation was essentially conducted to see if the occurrence could be avoided in the future, while the other was to see if it could be replicated.
This led to the decision that we would roll up the values using the translation, and if there were Os below the level we showed, we marked the grade with an icon. (Note: As we continually worked to improve our tools and processes, we adapted the icon when there was any anomaly hiding in the grades with an E or an O!) Figure 10-1 is the Translation Grid we used to convert a grade to a letter grade and back again. We originally colored the values as Green for Exceeds, Blue for Meets, and Red for an Opportunity. This made it very hard to convince anyone that Opportunity for Improvement (red) were just as much an anomaly requiring investigation as an Exceeds Expectations (green).
Figure 10-1. Translation Grid So looking at the Availability charts we added the expectations, so that visually we could tell where we were in terms of health of availability. This visual depiction happens at the measurement level, before we aggregate the grades with other measures of availability (to create a final grade for the category) and before we look to roll up grades into delivery. Figures 10-2 and 10-3 show the abandoned rate and calls abandoned in less than 30 seconds, with expectations.
Figure 10-2. Abandoned call rate with expectations.
Figure 10-3. Percentage of calls abandoned in 30 seconds or less.
Speed.
Speed wasn"t as simple as Availability. We could measure how many cases were responded to (or resolved) faster than expected, within expectations, and slower than expected. The problem was determining what that meant. What was good? We could have said any case that fell out of the Meets Expectations range (above or below) was an anomaly and should be investigated. That sounds logical, but since there were thousands of cases, it was not practical. And when I interviewed the department, it was clear that anomalies would happen from time to time. There were times when the a.n.a.lysts would take longer to respond than expected. And other times they would pick up on the first ring. This was a natural byproduct of the nature of the work and the environmental factors that influenced performance as well as workload.
So for these cases, we decided to determine what was expected by the customer at a second level. We asked the following: What percentage of cases does the customer feel should exceed expectations?
What percentage of cases does the customer feel should meet expectations?
What percentage of cases does the customer feel is acceptable to fail to meet expectations?
So we looked to define the expectations in the form of length of time to respond and the time to resolve.
Time to Respond Exceed: Responds in less than 5 seconds Meets: Responds in 6 to 30 seconds Opportunity for Improvement: Responds in greater than 30 seconds Time to Resolve Exceed: Resolved in one hour or less Meets: Resolved in 24 hours or less Opportunity for Improvement: Resolved in five days or less For each we needed to determine the customers" expectations. What percentage of the cases would the customer expect to fall into each of the categories listed, as shown in Table 10-2.
Figure 10-4 shows percentage of cases resolved in less than one hour. While this is a good measure, all three are necessary to get the full picture. Looking at only this measure would give a skewed view of how healthy the service was (in terms of speed).
Figure 10-4. Percentage of calls resolved in less than one hour This second level of expectations allowed us to use percentages, and allowed us to look at anomalies only when they added up to a significant (as defined by the expectations) amount of cases. It"s worthwhile to note that the third measure for Speed: Time to Respond, moves in the opposite direction of the other measures. This will also be the case with rework, where less is better.
Accuracy.
Rework turned out to be the best measure of accuracy for the service desk. Figure 10-5 shows Rework in the form of percentage of cases.
Figure 10-5. Percentage of Rework It may be worth noting that the picture or impression the viewer of your metric gets can be affected by the way you present it. Let"s look quickly at a couple of different representations of the exact same data for Rework. Figures 10-6 and 10-7 have the same data as 10-5, but I"ve changed the coloring on the first and the scale on the second.
Figure 10-6. Percentage of Rework background colors reversed Figure 10-7. Percentage of Rework with scale increased from 0 to 6% to 0 to 25% You can imagine the many permutations possible that can affect the viewer"s perception of the data. One may make you feel the data is "bad," another that the data is "all right," and another may make you see it as "good."
The interpretations of the metric based on how it is presented must be limited. This is done through consistent and thorough communications on what you are looking for-anomalies. You cannot get caught up with how things look; not with how the charts look or how they make you or your unit look.
It is critical that you understand expectations and how to evaluate the charts presented. You"re looking for trends and anomalies, not a feeling of "goodness" based on the colors or values. This is the reason I have attempted to give these charts only neutral colors and even leave off any obvious demarcations of the values that const.i.tute an Opportunity for Improvement vs. Exceeding Expectations. Table 10-3 shows the expectations for percentage of rework.
We again were able to use percentages-which provided a consistent view of the different measures. While we used various measures (triangulation), we simplified it all by using a consistent form, a consistent view (the customers") and a consistent set of "grades" for each. We were able to keep to this established set of norms with Usage.
Usage.
Usage was defined by the number of unique customers (Table 10-4). The number (data) by itself was meaningless. But simply putting it into context in the form of a measure, percentage of customer base, made it more useful. When we started measuring it, we looked at a year at a time. We showed the number of unique customers per month (a running total) so we could see slow months from more active ones. When we showed the measures over time, we started with a blank slate each year. This always put us below expectations at first, and showed a steady increase over the year until at the end of the year, we were well within expectations (see Figure 10-8).
Since this always gave the impression (and grade) of an Opportunity to Improve until the second quarter, we relooked at the presentation of the information. Not because it "looked bad" but because it was telling the wrong story. Our usage wasn"t lower at the beginning of the year-we were incorrectly starting with a clean slate each year.
Figure 10-8 shows how the usage, when viewed over a year"s time, gives the impression that there is an anomaly for the first three months.
Figure 10-8. Usage: first time callers c.u.mulative over time So, a better representation was to have the measures shown over a running period of time. We could show it over a full year"s time (since we had enough historical data, or a smaller span of time. Options included showing a running three month or six month total as well as the twelve month total. Whichever we chose, we"d have to determine the expectations-what percentage of unique customers do we expect to use our services in a year, half a year, or a quarter?
Another factor to consider was the expected frequency of use for the service. For services that were likely to be used only once a year (like a car tune up), the service provider would measure first time callers over an annual period.
If the service were a semi-annual one, say more like an oil change, it would make sense to measure it at that interval. The point being, it would depend on how often you would expect customers to come back to use your service or buy your product. Even a restaurant would work in this manner. A higher-scale restaurant may expect to see repeat customers on a monthly or quarterly basis, while a fast-food restaurant may expect a higher frequency.
The measure of unique customers can easily be combined with repeat customers. Most businesses rely on repeat customers for the majority of their income. Repeat-customer rates speak directly to the relationship the business has with their customers. To grow, much less survive, the business must satisfy their customer base so that they earn their trust. If the customers like your services or products, they will eventually come to your business again.
In the case where you are selling only one product or service, and the need for repeat purchases are rare, the satisfied customer is still your best salesperson. Rather than measure repeat customers, you may want to measure referrals. Remember the story of my laptop purchase (Chapter 7)? A week later, because of the selection, price, but mostly the customer service, I brought my daughter to the same store to buy her laptop. Normally I wouldn"t buy another computer for three or four years.
Of course, this store sold more than computers and they should measure whether I return to buy other technology from them. A purer example would be my first book. I"m proud of Why Organizations Struggle So Hard to Improve So Little. It is a good read. But how many sales should I expect to the same customer? In this case you might think total sales are the only measure I need. But, I could learn from looking at usage measures also. Imagine if I could get the number of books ordered by one person or one organization; or the number of referrals-sales in which the buyer was influenced to buy my book from the encouragement of another reader. Another measure could be the number of reviews and the ratings that accompany those reviews. Of course, if I sell another book (like this one), I would want to measure repeat customers if I could. How many people buy both books? If the reader liked one, hopefully they liked it enough to read the other, expecting a certain level of quality and information.
The reason Fred Reichheld"s predictors of promoters and detractors has merit is because word of mouth advertising-the kind you can"t buy-is critical to a business"s growth. New customers are nice to have, but repeat customers become your foundation for continued success and future growth.
In the case of our Service Desk, we expected customers to run into information technology issues on a quarterly basis. If they were calling only once or twice a year, it might indicate that they are using a different source for solving their IT problems. This might include just trying to solve their issues on their own. If they were calling weekly it might mean that the organization"s product line and service catalog had too many defects or faults-requiring frequent a.s.sistance.
The expectations can be the same, but since we were looking originally at the expectations for a full year, we logically shouldn"t expect as high an amount of first time callers (unique users) for a three month period. Since the department felt that the usage was healthy for the period reported, we chose to review the data first. If you aren"t sure, a simple tool is letting the measures tell you what the expectations should be.
Figure 10-9 gives us the picture without expectations so that we can use the data to determine what is "normal."
Figure 10-9. First time callers: three-month running totals (without expectations) Based on the picture presented by the measures on a running three-month total, the norm looked to be between 5 percent and 15 percent. When I spoke with the team, they felt that five percent was too low. Even though this would create a picture that showed them having Opportunity for Improvement more often, this felt "right to them." They also felt that the Exceeding Expectations should be set at 20 percent, making the expectations range from 10 percent to 20 percent. Because of the measures, I pushed them to find out why they chose 20 percent. The answer was that "15 percent was just a little too low." So, I pressed some more. "So, why 20 percent?" And as you may have guessed, the answer was, "it"s the next value."
So I set the range at 10 percent to 17.5 percent. Don"t let conventions keep you from setting the correct expectations.
Don"t let conventions keep you from doing the right thing.
Figure 10-10 shows the three-month running total expectations for first time callers.
Figure 10-10. First time callers: three-month running totals (with expectations) Table 10-5 captures the expectations in table format.
Once we finished determining the proper measures for unique customers, and their expectations, we looked at our other measure for usage. Another choice was the percentage of respondents who chose the Service Desk as their primary service provider in our annual survey. Table 10-6 shows the expectations we identified for the survey results.
Figure 10-11 shows the percentage of survey respondents who listed the Service Desk as their first choice when seeking a.s.sistance with their Information Technology problems.
Figure 10-11. Percentage of customers who preferred the Service Desk for a.s.sistance Although the results for 2008 were not an anomaly, the results were so different from the following year that we had to investigate to find out why. The only determination of cause we could find was that in 2008 respondents were asked to pick their top three preferences. The Service Desk was listed as one of the top three choices for 59 percent of the respondents. The next two years the question only allowed one answer-the top choice/preference. You"ve seen how the charting can change the perception of the measures. This is another strong example of how measures can be totally different based on how the data is collected.
As mentioned, we could also add a measure on repeat customers. If we wanted to stick with percentages, we could produce a measure of the percent of repeat customers compared to the total customers for the given period. Again this could be on a three-, six-, or twelve-month cycle. The really good news is that this measure could be derived from the same data set that gave us the count of unique customers. Where unique customers were compared to total possible customers (customer base), repeat customers would be compared to the total customers for a given period.
The final category of measure we used was what most people think of as the first to collect-customer satisfaction surveys.
Customer Satisfaction.
Again we were lucky since the data was already being collected, compiled, and a.n.a.lyzed for us. As mentioned, we had to decide which view of the data worked best. Basically we had to determine the measure to build from the data.
The department had been reporting this measure for a long time, but only as an average on the 5-point Likert scale. Our leadership wondered what to make of the average grade-how to interpret it. Being problem solvers, they didn"t leave the solution to the department-instead they asked for benchmarks to compare the average to. They reasoned if they were better than the "national average" or the average of their compet.i.tors, they were doing well. In reality this would only tell them that they were doing better than the average. They could claim to be in the top half. If we were really lucky, we"d find the following: Our peers were using the same questions that we were. The third-party provider had an impressive list of customers, but they did not have all of our peers or a monopoly on the service.
Our peers were using the same 5-point scale that we were.
Our peers were willing to share their average grades.
Our peers determined the average the same way-they could discount repeat customers within a short time frame, they could categorize their customers differently, or they could not include certain customers-like internal users of the Service Desk. Another difference could be if they surveyed only a sampling and we surveyed all users.
This magical alignment of stars was unlikely to happen, much less stay in alignment for an extended period of time. So, what we"d have instead as a benchmark would be a sampling of some of our peers. This was my first argument against using benchmarks to try and make the measures meaningful.
My second argument was that even when compared to a valid benchmark, it only showed how well we did versus the standard selected-not how well we were satisfying our customers. We could feasibly be at the "top of our cla.s.s" and still be well below our customers" expectations. If you bring home a B- average on your Report Card, chances are your parents will not be happy. If you tell them that you are the top student in your cla.s.s they may be a little more impressed or they may decide to change schools. A Report Card should tell you how well you are doing regardless of your position in your graduating cla.s.s. Once you know how well you are doing (or have done) it is a bonus to know how you rank with your peers.
My third and last argument was based on principles; measures should be meaningful to the organization before finding benchmarks. Benchmarks should only be used as an enhancement to the information-not act as the definer of it. The measures had to be meaningful on their own.
Figure 10-12 shows the average grade. Even though I argued against this as being less meaningful than other views, when you add the expectations, even this measure becomes useful.
Figure 10-12. Average Customer Satisfaction grade I still wanted a better representation of the measures. We tried using promoters to detractors, but since the vendor wasn"t going to change to a 10-point scale, we had to translate the 5-point scale to the methods Reichheld suggests for determining where a customer fell on the range of support. We ended up making 5s promoters, and 1, 2, or 3s detractors. This was an attempt to match Reichheld"s 1s6s (detractors), 7s and 8s (neutral), and 9s and 10s (promoters). While this was not perfect (or optimum) I believe it was valid and if anything we again were "erring on the side of excellence." But showing this ratio (highly satisfied vs. not satisfied) proved problematic. While it was more meaningful than the average rating, it still was difficult for management to interpret.
The conversation would go something like the following: "So, for every 1, 2, or 3 we received, we had twelve 5s?"
"Yes, your ratio of promoters to detractors was twelve to one."
"What about the 4s? Why aren"t we counting 4s?"
"Because 4s are being considered neutral. We can"t tell how they"ll "talk" about our service. They may say it was good or they may not."
"I thought 3s were neutral?"
"Threes are in the middle, neither satisfied nor dissatisfied, and we believe that if someone can"t say they were satisfied (4 or 5) then they will definitely talk badly about our service-they will detract from our reputation."
"Well if we leave out 4s, we"re missing data...so it"s not a complete picture."