Collecting measurements is a core activity for all data-driven enterprises, engineering endeavors, and scientific inquiries. Measurement objectives are fundamental for effective management and engineering. Meeting these objectives means understanding measurement capabilities and limitations, recognizing that all data collected is driven both by what is being measured as well as how measurement is done, and mastering the properties that define a “good” measurement system while actively avoiding “bad” practices.
Collecting measurements is challenging. Flawed measurement, born from well-intentioned but poorly designed initiatives, can create a form of organizational “debt”. This debt manifests as wasted resources, demotivated teams, and poor strategic decisions based on data that is misleading, inaccurate, or simply irrelevant.
Identifying bad measurement practices—and the statistical fallacies that underpin them—is not incidental. Many organizations are not suffering from a lack of measurement, but from the misapplication of it. The challenge is not to simply begin measuring, but to re-ground the organization’s measurement system in a repeatable and reliable approach. This paper serves as both a diagnostic guide to the pathologies of failed measurement systems and suggests strategies and tactics for effective, reliable, repeatable, and trustworthy measurement.
Measurement does not directly “satisfy a specific customer” or “create a specific product or service”. We do not measure for the sake of measuring; we measure to gain understanding and insight that informs action.
The value obtained from measuring is indispensable but indirect. A measurement system provides the only possible means for moving beyond subjective “gut feel” or group consensus, instead providing a consistent, repeatable, and objective tool for understanding the work being done. This shared, objective model of reality is the prerequisite for all rational action, comparison, and improvement.
A measurement system provides three benefits:
A famous aphorism, variously attributed to Peter Drucker, W. Edwards Deming, and Lord Kelvin, states: “What gets measured, gets done”. This statement is the balancing act of management must navigate: without measurement, little will be accomplished, but measuring the wrong thing–or measuring it incorrectly–leads to unintended results.
Measurement is a powerful tool for focusing organizational energy, signaling strategic priorities, and aligning the behavior of individuals with the goals of the enterprise. By establishing a metric, leadership communicates what it values and the organization works to optimize for that metric.
This same principle is also measurement’s single greatest risk. The power to direct behavior is also the power to unintentionally create perverse incentives. People will always act in a way to make measurement data favorable, whether that action is achieving the desired result or manipulating the system in order to appear to be achieving the result. The power to motivate and the power to create unintended, counter-productive behaviors are not separate.
Our first step is understanding the building blocks of measurement. This section establishes the technical foundations that separate a good measurement system from naive counting.
Agile software development practitioners advocate for “napkin metrics” where “rough numbers are good enough”. This Agile philosophy is not an argument for sloppiness, but one for cost-benefit analysis in measurement. A “rough” number is the desired result as long as its implicit “error bars” are small enough that any resulting decision would be identical to the one made with a higher-precision, high-cost number. This approach is excellent advice to guide how we will define measurement as being good for making fact-based decisions, but no better than that.
Expanding on “good enough”, “cost-benefit”, and :fact-based”, a practical, working definition of a measure is: “A method agreed upon by the parties using it for assigning a value to an artifact or activity in order to compare the artifact or activity to a standard or to other artifacts or activities”.
This definition can be divided into its key components:
Before collecting any value, we need to understand that all measurement data falls into one of four scales. These scales establish the properties of the data and what can and cannot be done with it.
Table 1: The Four Scales of Measurement
|
Scale |
Definition |
Example |
Scholarly Example |
Basis for Comparison |
|---|---|---|---|---|
|
Nominal |
The data consists of labels, such as names. No indisputable ordering of the data exists. |
People’s names. Properties of software object functions (security, queue management). |
Equivalence (one name is the same as another) |
|
|
Ordinal |
The data consists of values that can be placed in an order, but differences between the values are not defined. |
Names of days of the week. Severity of software defects (critical, high, medium, low). 1-5 Star User Rating. |
Equivalence, Order |
|
|
Interval |
The data consists of values that can be placed in order and among which differences can be calculated. However, there is no indisputable origin (0 point) and ratios are not defined. |
IP addresses (192.168.1.15, 192.168.1.73). Calendar date. |
Equivalence, Order, Difference |
|
|
Ratio |
The data consists of values that can be placed in order and among which differences and ratios can be calculated. An origin (0 point) for the values is defined. |
Response Time in Milliseconds. Lines of Code (LOC). Staff-Hours. Number of Defects in a product. |
Equivalence, Order, Difference, Ratio (division) |
Every metric collected using a measurement system has an associated measurement scale. The scale restricts the set of valid mathematical and statistical operations that can be performed on the data. A failure to respect these restrictions, known as a scale violation, can be an expensive error in measurement. This error leads to analyses that are not just wrong, but misleading and statistically meaningless.
For example, “story points” and “defect severity” are classic ordinal scale measurements. They establish a rank (e.g., “critical” is worse than “high”; a 5-point story is “harder” than a 3-point story), but the distance between the ranks is undefined. It is statistically invalid to claim that a “critical” defect is twice as bad as a “high” one, or that two 3-point stories are equivalent to one 6-point story. Treating such ordinal data as if it were ratio data—by adding, subtracting, or, most commonly, averaging it—is a scale violation that produces a nonsensical result.
Table 2: Valid Statistical Operations by Measurement Scale
|
Scale |
“Average” |
“Spread” |
Statistical Test |
Common Misuse Scenarios |
|---|---|---|---|---|
|
Nominal |
Mode (most frequent value) |
Undefined |
Chi-Squared Test |
Cannot be ordered or averaged. Stating that the color “brown” is inherently better than the color “green” has no meaning. |
|
Ordinal |
Median (middle value) |
Undefined |
Non-parametric tests (Sign, Run) |
Cannot be meaningfully added, subtracted, or averaged. Treating a “5-point scale” as interval data is a common but invalid practice. A 5-point story less a 3-point story does not equal a 2-point story–if that is even defined. |
|
Interval |
Arithmetic Mean |
Variance (standard deviation) |
T-test, F-test, ANOVA |
One measurement cannot be divided by another. The result of dividing 20°C by 10°C doesn’t mean anything. |
|
Ratio |
Arithmetic, Geometric, Harmonic Means |
Variance (standard deviation) |
T-test, F-test, ANOVA |
All mathematical operations are valid. |
Some measurement systems perform in a superior way with respect to providing consistent data, avoiding confusion, supporting analysis that leads to decisions, all the while avoiding measurement for measurement’s sake.
Building on the measurement scales, a “good” measure must possess three distinct technical properties: precision, accuracy, and sensitivity.
Finally, because measurement is an act of sampling from a complex reality, all conclusions drawn from it are probabilistic and carry the risk of error. In a statistical hypothesis-testing framework—that is, a mathematical viewpoint on making decisions—these risks are formalized into two error types.
Table 3: The Hypothesis Testing Error Matrix
|
Reality: Hypothesis is true |
Reality: Hypothesis is false |
|
|
Decision: Reject the hypothesis |
Type I Error (False Positive) |
Correct Conclusion (True Negative) |
|
Decision: Fail to reject the hypothesis |
Correct Conclusion (True Positive) |
Type II Error (False Negative) |
These are not just academic concepts but are the business risk framework for all data-driven decisions.
A good measurement system must be designed to manage and account for the costs of both error types. The nature of the consequences from making bad decisions versus the cost of the measurement system dictates whether type I or II error is more important. For example, consideration of these consequences is a key driver in a judicial system: is convicting an innocent person charged with murder (false positive–type I) worse or is letting a killer go free (false negative–type II) worse? Judicial systems based on English Common Law, which includes the US judicial system, tend to treat type I errors as worse than type II errors.
Given a particular management or technical decision to be made, either one-and-done decision or ongoing decisions, the basics of implementing a good measurement system to support the decision(s) are:
“All happy families are alike; each unhappy family is unhappy in its own way,” Leo Tolstoy, Anna Karenina.
Given the properties that we want our measurement system to have, consider what happens when measurement goes poorly.
There is no comprehensive list of all the ways that a measurement program can fail; each failed program–like unhappy families–has its own special way of failing. The pathologies listed below are common examples of failure. Most failed programs exhibit some combination of these pathologies, to a greater or lesser degree. Given a particular decision objective and the need to provide useful data and analysis in support of that objective, efficiently and effectively implementing a measurement system requires both knowledge of the characteristics described previously as well as common measurement mistakes. This section provides examples of these mistakes.
The following examples provide diagnostic archetypes of failed measurement systems:
Case 1: Unactionable data–given the data, the condition(s) indicating that action is needed is unclear
Example
Case 2: Incomplete/insufficient data–either not enough data or not all of the different data needed are collected
Case 3: Wrong measurement scale–the measurement system collects data corresponding to a scale that will not support the needed analysis
Case 4: Discarding collected data after analysis
Case 5: Inter-coder reliability
Case 8: Correlation versus causation
Case 9: Perverse incentives and unintended behaviors–the measurement system is manipulated to appear to meet a desired result
Good measurement systems don’t happen by mistake. A robust, valuable, and reliable measurement system is not an ad-hoc creation but is instead an engineered system. Its implementation follows a deliberate, high-level process.
This process is most effectively structured around the “Goal-Question-Metric-Indicator” (GQMI) framework. GQMI is a widely accepted technique for ensuring that all measurement activity is traceably linked to a strategic business need. The process includes defining goals, defining the measures, developing collection, storage, and analysis mechanisms, and finally, communicating and acting on the results.
A measurement system begins by understanding mission or business information needs and objectives. The GQMI framework deriving measurement objectives from the organization’s top-level goals. The core concept is traceability.
We build traceability between business/mission goals and measurements by identifying the business or mission need for which we need more information. This need will provide the motivational answer of “Why are we measuring this?” This goal-first step is also how we avoid measurement mistakes like “Measuring Because We Can” and “We Measured But We Cannot Act”. This step is straightforward–without stating anything about measurement or measuring, answer the questions:
Once we know the questions that need to be answered to understand if we are (or can) meet the business/mission goals, then we can start building the measures to answer those questions. Of course, before we take the expensive step of creating new measures, be sure to check if there are any existing measures—perhaps with a little refinement or in combination with other measures–that will answer the questions.
If no existing measures can be refined, adapted, or supplemented then we can proceed with defining new measures to answer the questions generated by the business/mission goal. We create these additional measures by developing an operation definition of the new measures. The operational definition is the single most critical artifact in a measurement system.
An operational definition is the specification that provides “sufficient detail to enable the measure to be collected in practice”. Operational definitions help us to avoid the error of the measurement system being manipulated to appear to meet a desired result. It is the mechanism by which a measure becomes “repeatable” (avoiding bias) and “agreed upon” (avoiding inter-code reliability).
Although developing an operational definition involves a lot of work, an operational definition provides sufficient detail to enable the measure to be collected in practice while avoiding wasteful rework and false starts:
There are many potential error cases that an operational definition of a measure exposes and helps us to resolve. For example, consider the step “defines the format of the data to be collected”. If we are collecting time data, such as a timestamp for when an event occurred, then:
Likewise, for the source of the data, what clock is providing the timestamp—a central clock, time on a local computer, . . . ?
Developing a response to these definitions, particularly by understanding how the definitions might be vague or inconsistent, will create good operational definitions. Going the extra step of testing the operational definitions by actually collecting data on a limited basis using the operational definitions will help verify that the measurements work as intended.
The final step involves the results of the measurement analysis back into decision-supporting “information” and “insight” that are the answers to the questions developed in step 1.
Measurement results, and the entire measurement system, are useful if and only if they answer the business/mission goal questions. Answering these questions depends not only of the technically-generated information produced by measurement, but also in expressing the information when needed and in an understandable format to the owners of the business/mission goals. Unless needed to support validation of the measurement analysis, few goal owners care about the technical details of a measure’s operational definition. The key is expressing the measurement results, in response to the business/mission goals, in a way that is (1) as brief as possible, (2) understandable, and that (3) leads to a decision (actionable).
Measurement is not a single time, “define and done” task. It is a continuous, cyclical system of inquiry. A healthy measurement program isn’t about accumulating data indefinitely, but instead answering questions and, when and if those particular questions from mission/business goals have been satisfied, moving forward to the next set of information needs. From a macroscopic viewpoint, good measurement systems are driven by the pragmatic Agile software development principle: “if it isn’t changing, who cares?” A measurement program should constantly refine its “Goal-Question-Metric-Indicator” loop to reflect new, emerging business/mission needs. By adhering to a process—from goal-setting to operational definition to analysis to communications—an organization can move from practicing measurement poorly to leveraging it as a core strategic asset for objective understanding and continuous improvement.