Relative Estimates

10/21/2021

Relative Estimates

In my past articles related to project and sprint planning, we touched on the concept of relative estimates. Those articles were focused more on the planning aspect and the usage of the estimates and less on the actual process of estimation. So let's talk about estimation techniques my colleagues and I found useful.

Exact estimate

I already touched on this before, there is a huge misunderstanding in what makes a feature development estimate exact. People intuitively think that an exact estimate is a precise number with no tolerance. Something like 23.5 man-days of work. Not a tad more or less.

How much can we trust that number? I think we all feel that not much unless we know more about how the estimate was created. What precise information did the estimator base his estimate on? What assumptions did he make about future progress? What risks did he consider? What experience does he have with similar tasks?

We use this knowledge to make our own assessment on how likely it is that the job's duration will vary from the estimate. What we do is make our own estimation of a probable range, where we feel the real task's duration is going to be.

It is quite a paradoxical situation, isn't it? We force someone to come up with precise numbers so that we can do our own probability model around it. Wouldn't it be much more useful for the estimate to consider this probability in the first place?

That also means that (in my world) a task estimate is never an exact number, but rather a qualified prediction of the range of probability in which a certain job’s duration is going to land. The more experience with similar tasks the estimator has, the narrower the range is going to be. A routine task that one has already done hundreds of times can be estimated with a very narrow range.

But even with a narrow range, there are always variables. You might be distracted by someone calling you. You mistype something and have to spend time figuring it out. Even though those variables are quite small and will not likely alter the job's duration by an order of magnitude, it still makes an absolutely precise estimate impossible.

Linear and non-linear estimates

On top of all that, people are generally very bad at estimating linear numbers due to a variety of cognitive biases. I mentioned some of them here [link: Wishful plans - Planning fallacies]. So (not just) from our experience, we proved that it is generally better to do relative estimates.

What is it? Basically, you are comparing future tasks against the ones that you already have experience with. You are trying to figure out if a given task (or user story or job or anything else for that matter) is going to be more, less, or similarly challenging compared to a set benchmark. The more the complexity increases, the more unknowns, and risks there generally are. That is the reason why relative estimates use non-linear scales.

One of the well-known scales is the pseudo-Fibonacci numerical series, which usually goes like 0, 1, 2, 3, 5, 8, 13, 20, 40, 100. An alternative would be T-Shirt sizes (e.g. XS, S, M, L, XL, XXL). The point is that the more you move up the scale, the bigger is the increase in difference from the size below. That takes out a lot of the painful (and mostly wildly inaccurate) decision-making from the process. You're not arguing about if an item should be sized 21 or 22. You just choose a value from the list.

Planning poker

We had a good experience with playing planning poker. Planning poker is a process in which the development team discusses aspects of a backlog item and then each developer makes up his mind as to how “big” that item is on the given scale (e.g. the pseudo-Fibonacci numbers). When everyone is finished, all developers present their estimates simultaneously to minimize any mutual influence.

A common practice is that everyone has a deck of cards with size values. When ready, a developer will put his card of choice on the table, card facing down. Once everyone has chosen his card, all of the cards are presented.

Now each developer comments on his choice. Why did he or she choose that value? We found it helpful that everyone answers at least the following questions:

What are similarly complex backlog items that the team has already done in the past?
What makes the complexity similar to such items?
What makes the estimated item more complex than already done items, which were labeled with a complexity smaller by one size degree?
What makes the estimated item less complex than already done items, which were labeled with a complexity higher by one size degree?

A few typical situations can arise.

1) Similar estimates

For a matured team and well-prepared backlog items, this is a swift process, where all the individual estimates are fairly similar, not varying much. The team can then discuss and decide together as to what value it will agree on.

2) An outlying individual estimate

Another situation is that all individual estimates are similar, but there is one or two, which is completely different. This might have several causes. Either that outlying individual has a good idea, that no-one has figured out or he misunderstands the backlog item itself. Or he has not realized all the technical implications of the development of that particular item. Or he sees a potential problem that the others overlook.

In such situations we usually took the following approach. People with lower estimates explain the work they expect to be done. Then the developers with higher estimates state the additional work they think needs to be done in comparison to the colleagues with lower estimates. By doing this, the difference in their assumptions can be identified and now it is up to the team to decide if that difference is actually necessary work.

After the discussion is finished, the round of planning poker is repeated. Usually, the results are now closer to the first case.

3) All estimates vary greatly

It can also happen, that there is no obviously prevailing complexity value. All the estimates are scattered across the scale. This usually happens, when there is a misunderstanding in what is actually a backlog item's purpose and its business approach. In essence, one developer imagines a simple user function and another sees a sophisticated mechanism that is required.

This is often a symptom of a poorly groomed backlog that lacks mutual understanding among the devs. In this case, it is usually necessary to review the actual backlog item's description and goal and discuss it with the product owner from scratch. The estimation process also needs to be repeated.

Alternatively, this can also happen to new teams with little technical or business experience of their product in the early stages of development.

It's a learning process

Each product is unique, each project is unique, each development environment is different. That means the development team creates their perception of complexity references anew when they start a project. It is also a constant process of re-calibration. A few backlog items that used to serve as a benchmark reference size at the beginning of a project usually need to be exchanged for something else later on. The perception of scale shifts over time.

The team evolves and gains experience. That means the team members need to revisit past backlog items and ask themselves if they would have estimated such an item differently with the experience they have now. It is also useful, at the end of a sprint, to review items that in the end were obviously far easier or far more difficult than the team initially expected.

What caused the difference? Is there any pattern we can observe and be cautious in the future? For instance, our experience from many projects shows that stuff that involves integrations to outer systems usually turns out to be far more difficult in comparison to what the team anticipates. So whenever the devs see such a backlog item, the team knows it needs to think really carefully about what could go wrong.

Don't forget the purpose

In individual cases, the team will sometimes slightly overestimate and sometimes slightly underestimate. And sometimes estimates are going to be completely off. But by self-calibrating using retrospective practices and the averaging effect over many backlog items, the numbers can usually be relied on in the long run.

Always bear in mind that the objective of estimating backlog items is to produce a reasonably accurate prediction of the future with a reasonable amount of effort invested. This needs to be done as honestly as possible given the current circumstances. We won't know the future better unless we actually do the work we're estimating.