The Estimates in #NoEstimates

A couple of weeks ago, I tweeted a paraphrase of something that David J. Anderson said at the London Lean Kanban Day: “Probabalistic forecasting will outperform estimation every time”. I added the conference hashtag, and, perhaps most controversially, the #NoEstimates one.

The conversation blew up, as conversations on Twitter are wont to do, with a number of people, perhaps better schooled in mathematics than I am, claiming that the tweet was ridiculous and meaningless. “Forecasting is a type of estimation!” they said. “You’re saying that estimation is better than estimation!”

That might be true in mathematics. Is it true in ordinary, everyday English? Apparently, so various arguments go, the way we’re using that #NoEstimates hashtag is confusing to newcomers and making people think we don’t do any estimation at all!

So I wanted to look at what we actually mean by “estimate”, when we’re using it in this context, and compare it to the “probabilistic forecasting” of David’s talk.

Defining “Estimate” in English

While it might be true that a probabilistic forecast is a type of estimate in maths and statistics, the commonly used English definitions are very different. Here’s what Wikipedia says about estimation:

Estimation (or estimating) is the process of finding an estimate, or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable.

And here’s what it says about probabilistic forecasting:

Probabilistic forecasting summarises what is known, or opinions about, future events. In contrast to a single-valued forecasts … probabilistic forecasts assign a probability to each of a number of different outcomes, and the complete set of probabilities represents a probability forecast.

So an estimate is usually a single value, and a probabilistic forecast is a range.

Another way of phrasing that tweet might have been, “Providing a range of outcomes along with the likelihood of those outcomes will lead to better decision-making than providing a single value, every time.”

And that might have been enough to justify David’s assertion on its own… but it gets worse.

Defining “Estimate” in Agile Software Development

In the context of Software Development, estimation has all kinds of horrible connotations. It turns out that Wikipedia has a page on Software Development Estimation too! And here’s what it says:

Software development effort estimation is the process of predicting the most realistic amount of effort (expressed in terms of person-hours or money) required to develop or maintain software based on incomplete, uncertain and noisy input.

Again, we’re looking at a single value; but do notice the “high uncertainty” there. Here’s what the page says later on:

Published surveys on estimation practice suggest that expert estimation is the dominant strategy when estimating software development effort.

The Lean / Kanban movement has emerged (and possibly diverged) from the Agile movement, in which this strategy really is dominant, mostly thanks to Scrum and Extreme Programming. Both of these suggest the use of story points and velocity to create the estimates. The idea of this is that you can then use previous data to provide a forecast; but again, that forecast is largely based on a single value. It isn’t probabilistic.

Then, too, the “expertise” of the various people performing the estimates can often be questionable. Scrum suggests that the whole team should estimate, while XP suggests that developers sign up to do the tasks, then estimate their own. XP, at least, provides some guidance for keeping the cost of change low, meaning that expertise remains relevant and velocity can be approximated from the velocity of previous sprints. I’d love to say that most Scrum teams are doing XP’s engineering practices for this reason, but a lot of them have some way to go.

I have a rough and ready scale that I use for estimating uncertainty, that helps me work out whether an estimate is even likely to be made based on expertise.  I use it to help me make decisions about whether to plan at all, or whether to give something a go and create a prototype or spike. Sometimes a whole project can be based on one small idea or piece of work that’s completely new and unproven, the effort of which can’t even be estimated using expertise (because there isn’t any), let alone historical metrics.

Even when we have expertise, the tendency is for experts to remember the mode, rather than the mean or median value. Since we often make discoveries that slow us down but rarely make discoveries which speed us up, we are almost inevitably over-optimistic. Our expertise is not merely inaccurate; it’s biased and therefore misleading. Decisions made on the basis of expert estimates have a horrible tendency to be wrong. Fortunately everyone knows this, so they include buffers. Unfortunately, work tends to expand to fill the time available… but at least that makes the estimates more accurate, right?

One of the people involved in the Twitter conversation suggested we should be using the word “guess” rather than “estimate”. And indeed, that might be mathematically more precise, and indeed, if we called them that, people might be looking for different ways to inform the decisions we need to make.

But they don’t. They’re called “estimates” in Scrum, in XP, and by just about everyone in Agile software development.

But it gets worse.

Defining “Estimate” in the context of #NoEstimates

Woody Zuill found this very early tweet from Aslak Hellesøy using the #NoEstimates hashtag, possibly the first:

@obie at #speakerconf: “Velocity is important for budgeting”. Disagree. Measuring cycle time is a richer metric. #kanban #noestimates

So the movement started with this concept of “estimate” as the familiar term from Scrum and XP. Twitter being what it is, it’s impossible to explain all the context of a concept in 140 characters, so a certain level of familiarity with the ideas around that tag is assumed. I would hope that newcomers to a movement would approach it with curiosity, and hopefully this post will make that easier.

Woody confessed to being one of the early proponents of the hashtag in the context of software development. In his post on the #NoEstimates hashtag, he defines it as:

#NoEstimates is a hashtag for the topic of exploring alternatives to estimates [of time, effort, cost] for making decisions in software development.  That is, ways to make decisions with “No Estimates”.

And later:

It’s important to ask ourselves questions such as: Do we really need estimates? Are they really that important?  Are there options? Are there other ways to do things? Are there BETTER ways to do thing? (sic)

Woody, and Neil Killick who is another proponent, both question the need for estimates in many of the decisions made in a lot of projects.

I can remember getting the Guardian’s galleries ready in time for the Oscars. Why on earth were we estimating how long things would take? That was time much better spent in retrospect on getting as many of the features complete as we could. Nobody was going to move the Oscars for us, and the safety buffer we’d decided on to make sure that everything was fully tested wasn’t changing in a hurry, either. And yet, there we were, mindlessly putting points on cards. We got enough features out in time, of course, as well as some fun extras… but I wonder if the Guardian, now far more advanced in their ability to deliver than they were in my day, still spend as much time in those meetings as we used to.

I can remember asking one project manager at a different client, “These are estimates, right? Not promises,” and getting the response, “Don’t let the business hear you say that!” The reaction to failing to deliver something to the agreed estimates was to simply get the developers to work overtime, and the reaction to that was, of course, to pad the estimates. There are a lot of posts around on the perils of estimation and estimation anti-patterns.

Even when the estimates were made in terms of time, rather than story points, I can remember decisions being unchanged in the face of the “guesses”. There was too much inertia. If that’s going to be the case, I’d rather spend my time getting work done instead of worrying about the oxymoron of “accurate estimates”.

That’s my rant finished. Woody and Neil have many more examples of decisions that are often best made with alternatives to time estimation, including much kinder, less Machiavellian ones such as trade-off and prioritization.

In that post above, Neil talks about “using empiricism over guesswork”. He regularly refers to “estimates (guesses)”, calling out the fact that we do use that terminology loosely. That’s English for you; we don’t have an authoritiative body which keeps control of definitions, so meanings change over time. For instance, the word “nice” used to mean “precise”, and before that it meant “silly”. It’s almost as if we’ve come full circle.

Defining “Definition”

Wikipedia has a page on definition itself, which points out that definitions in mathematics are different to the way I’ve used that term here:

In mathematics, a definition is used to give a precise meaning to a new term, instead of describing a pre-existing term.

I imagine this refers to “define y to be x + 2,” or similar, but just in case it’s not clear already: the #NoEstimates movement is not using the mathematical definition of “estimate”. (In fact, I’m pretty sure it’s not using the mathematical definition of “no”, either.)

We’re just trying to describe some terms, and the way they’re used, and point people at alternatives and better ways of doing things.

Defining Probabilistic Forecasting

I could describe the term, but sometimes, descriptions are better served with examples, and Troy Magennis has done a far better job of this than I ever have. If you haven’t seen his work, this is a really good starting point. In a nutshell, it says, “Use data,” and, “You don’t need very much data.”

I imagine that when David’s talk is released, that’d be a pretty good thing to watch, too.

Posted in complexity, conference | 35 Comments

A Dreyfus model for Agile adoption

A couple of people have asked for this recently, so just posting it here to get it under the CC licence. It was written a while ago, and there are better maturity models out there, but I still find this useful for showing teams a roadmap they can understand.

If you want to know more about how to create and use your own Dreyfus models, this post might help.

What does an Agile Team look like?

Novice

We have a board
We put our stories on the board
Every two weeks, we get together and celebrate what we’ve done
Sometimes we talk to the stakeholders about it
We think we might miss our deadline and have told our PM
Agile is really hard to do well

Beginner

We are trying to deliver working software
We hold retrospectives to talk about what made us unhappy
When something doesn’t work, we ask our coach what to do about it
Our coach gives us good ideas
We have delegated someone to deal with our offshore communications
We have a great BA who talks to the stakeholders a lot
We know we’re going to miss our deadline; our PM is on it
Agile requires a great deal of discipline

Practitioner

We know that our software will work in production
Every two weeks, we show our working software to the stakeholders
We talk to the stakeholders about the next set of stories they wants us to do
We have established a realistic deadline and are happy that we’ll make it
We have some good ideas of our own
We deal with blockers promptly
We write unit tests
We write acceptance tests
We hold retrospectives to work out what stopped us delivering software
We always know what ‘done’ looks like before we start work
We love our offshore team members; we know who they are and what they look like and talk to them every day
Our stakeholders are really interested in the work we’re doing
We always have tests before we start work, even if they’re manual
We’ve captured knowledge about how to deploy our code to production
Agile is a lot of fun

Knowledgeable

We are going to come well within our deadline
Sometimes we invite our CEO to our show-and-tell, so he can see what Agile looks like done well
People applaud at the end of the show-and-tell; everyone is very happy
That screen shows the offshore team working; we can talk to them any time; they can see us too
We hold retrospectives to celebrate what we learnt
We challenge our coach and change our practices to help us deliver better
We run the tests before we start work – even the manual tests, to see what’s broken and know what will be different when we’re done
Agile is applicable to more than just software delivery

Expert

We go to conferences and talk about our fantastic Agile experiences
We are helping other teams go Agile
Business outside of IT are really interested in what we’re doing
We regularly revisit our practices, and look at other teams to see what they’re doing
The company is innovative and fun
The company are happy to try things out and get quick feedback
We never have to work late or weekends
We deploy to production every two weeks*
Agile is really easy when you do it well!

* Wow, this model is old.

Posted in learning models | 5 Comments

What is BDD?

At #CukeUp today, there’s going to be a panel on defining BDD, again.

BDD is hard to define, for good reason.

First, because to do so would be to say “This is BDD” and “This is not BDD”. When you’ve got a mini-methodology that’s derived from a dozen other methods and philosophies, how do you draw the boundary? When you’re looking at which practices work well with BDD (Three Amigos, Continuous Integration and Deployment, Self-Organising Teams, for instance), how do you stop the practices which fit most contexts from being adopted as part of the whole?

I’m doing BDD with teams which aren’t software, by talking through examples of how their work might affect the world. Does that mean it’s not really BDD any more, because it can’t be automated? I’m talking to people over chat windows and sending them scenarios so they can be checked, because we’re never in the same room. Is that BDD? I’m writing scenarios for my own code, on my own, talking to a rubber duck. Is that it? I’m still using scenarios in conversation to explore and define requirements, with all the patterns that I’ve learnt from the BDD world. I’m still trying to write software that matters. It still feels like BDD to me. I can’t even say that BDD’s about “writing software that matters” any more, because I’ve been using scenarios to explore life for a while now.

I expect in a few years the body of knowledge we call “BDD” will also include adoption patterns, non-software contexts, and a whole bunch of other things that we’re looking at but which haven’t really been explored in depth. BDD is also the community; it’s the people who are learning about this and reporting their learning and asking questions, and the common questions and puzzles are also part of BDD, and they keep changing as our understanding grows and the knowledge becomes easier to access and other methods like Scrum and Kanban are adopted and enable BDD to thrive in different places.

Rather than thinking of BDD as some set of well-defined practices, I think of it as an anchor term. If you look up anything around BDD, you’re likely to find conversation, collaboration, scenarios and examples at its core, together with suggestions for how to automate them. If you look further, you’ll find Three Amigos and Outside-In and the Given / When / Then syntax and Cucumber and Selenium and JBehave and Capybara and SpecFlow and a host of other tools. Further still we have Cynefin and Domain Driven Design and NLP, which come with their own bodies of knowledge and are anchor terms for those, and part of their teaching overlaps part of what I teach, as part of BDD, and that’s OK.

That’s why, when I’m asked to define BDD, I say something like, “Using examples in conversation to illustrate behaviour.” It’s where all this started, for me. That’s the anchor. It’s where everything else comes from, but it doesn’t define the boundaries. There are no boundaries. The knowledge, and the understanding, and especially the community that we call “BDD” will keep on growing.

One day it will be big enough that there will be new names for bits of it, and maybe those new names will be considered part of BDD, and maybe they won’t. And when that happens, that should be OK, too.

NB: I reckon the only reason that other methods are defined more precisely is so they could be taught consistently at scale, especially where certification is involved. Give me excellence, diversity and evolution over consistency any day. I’m pretty sure I can sell them more easily… and so can everyone else.

Posted in bdd, conference | 6 Comments

The Shallow Dive into Chaos

For more on the Chaotic domain and subdomains, read Dave Snowden’s blog post, “…to give birth to a dancing star.” The relationship between separating people that I talk about here, and the Chaotic domain, can be seen in Cynthia Kurtz’s work with Cynefin’s pyramids, as seen here.

On one of my remote training courses over WebX, I asked the participants to do an exercise. “Think of one way that people come to consensus,” I requested, “and put it in the chat box.”

Here’s what I got back…

1st person: Voting

2nd person: Polling

3rd person: Yeah, I’ll go with polling too

And then I had to explain to them why I was giggling so much.

This is, of course, a perfect demonstration one of the ways in which people come to consensus: by following whoever came first. We might follow the dominant voice in the room, or the person who’s our boss, or the one who brought the cookies for the meeting, or the one that’s most popular, or the one who gets bored or busy least quickly.

We might even follow the person with the most expertise.

MACE: We’ll have a vote.
SEARLE: No. No, we won’t. We’re not a democracy. We’re a collection of astronauts and scientists. So we’re going to make the most informed decision available to us.
MACE: Made my you, by any chance?
MACE: Made by the person qualified to understand the complexities of payload delivery: our physicist.
CAPA (Physicist): …shit.

— “Sunshine”

If you’re doing something as unsafe to fail as nuclear payload delivery, getting the expert to make a decision might be wise. (Sunshine is a great film, by the way, if you’re into SF with a touch of horror.)

If you’re doing something that’s complex, however, which means that it requires experiment, the expertise available is limited. Experiments, also called Safe-To-Fail Probes, are the way forward when we want our practices and our outcomes to emerge. This is also a fantastic trick for getting out of stultefying complicatedness or simplicity, and generating some innovation.

But… if you stick everyone in a room and ask them to come up with an experiment, you’ll get an experiment.

It just might not be the best one.

More ideas mean better ideas

In a population where we know nothing about the worth of different ideas, the chance of any given idea being above average is 50%. If we call those “good ideas”, then we’ve got a 50% chance of something being good.

Maybe… just maybe… the idea that the dominant person, or the first person, or the expert, or the person with the most time comes up with will be better than average. Maybe.

But if you have three ideas, completely independently generated, what’s the chance of at least one of them being above average?

Going back to my A-level Maths… it’s 1 – (chance of all ideas being below average) which is 1 – (1/2 x 1/2 x 1/2) which is 87.5%.

That’s a vast improvement. Now imagine that everyone has a chance to generate those ideas.

If you want better ideas, stop people gaining consensus too early.

For this to work, the experiments that people come up with have to be independent. That means we have to separate people.

Now, obviously, if you have a hundred and twenty people and want to get experiments from them, you might not have time to go through a hundred and twenty separate ideas. We still want diversity in our ideas, though (and this is why it’s important to have diversity for innovation; because it gives you different ideas).

So we split people into homogenous groups.

This is the complete opposite of Scrum’s cross-functional teams. We want diversity between the groups, not amongst them. This is a bit like Ronald S. Burt’s “Structural Holes” (Jabe Bloom’s LKUK13 talk on this was excellent); innovation comes from the disconnect; from deliberately keeping people silo’d. We put all the devs together; the managers together; the senior staff together; the group visiting from Hungary together; the dominant speakers together… whatever will give you the most diversity in your ideas.

Once people have come up with their experiments, you can bring them back together to work out which one are going to go ahead. Running several concurrently is good too!

If you’ve ever used post-its in a retrospective, or other forms of silent work to help ensure that everyone’s thoughts are captured, you’re already familiar with this. Silent work is an example of the shallow dive into chaos!

Check that your experiments are safe-to-fail

Dave Snowden and Cognitive Edge reckon you need five things for a good experiment:

  • A way to tell it’s succeeding
  • A way to tell it’s failing
  • A way to amplify it
  • A way to dampen it
  • Coherence (a reason to think it might produce good results).

If you can think of a reason why your experiment might fail, look to see if that failure is safe; either because it’s cheap in time or effort, or because the scale of the failure is small. The last post I wrote on using scenarios for experiment design can help you identify these aspects, too.

An even better thing to do is to let someone else examine your ideas for experiment. Cognitive Edge’s Ritual Dissent pattern (requires free sign-up) is fantastic for that; it’s very similar to the Fly-On-The-Wall pattern from Linda Rising and Mary Lynn Mann’s “Fearless Change”.

In both patterns, the idea is presented to the critical group without any questions being asked, then critiqued without eye contact; usually done by the presenter turning around or putting on a mask. Because as soon as we make eye contact… as soon as we have to engage with other people… as soon as we start having a conversation, whether with spoken language or body language… then we automatically start seeking consensus.

And consensus isn’t always what you want.

Posted in complexity, cynefin | 1 Comment

A Stakeholder goes to St. Ives

As I was trying to resolve my problem, I met a portfolio team with seven programmes of work.

Each programme had seven projects;

Each project had seven features;

Each feature had seven stories;

Each story had seven scenarios.

How many things did I need to resolve?

Posted in business value | 7 Comments

Using Scenarios for Experiment Design

In the complex domain, cause and effect are only correlated in retrospect, and we cannot predict outcomes. We can see them and understand them in retrospect, but the complex domain is the domain of discovery and innovation. Expect the unexpected! Every time we do something new which hasn’t been done before, or hasn’t been done within the given context, there are going to be complex aspects to it.

The only discovery-free project would be the same project, done with the same people, the same technology and the same requirements. That never happens!

Because of this, analysis doesn’t work for every aspect of a project. People who try to do analysis in the complex domain commonly experience analysis paralysis, thrashing, two-hour meetings led by “experts” who’ve never done it before either, and arguments about who’s to blame for the resulting discoveries.

Instead, the right thing to do is to probe; to design and perform experiments from which we can learn, and which will help to uncover information and develop expertise.

There are few things we can do to design experiments well, with thanks and credit for the strategies going to Cognitive Edge. (They have more material on this, available in their library if you sign up for free). Suggestions around scenarios are mine.

Amplification strategy; recovery strategy

For our experiment to work, we have to know how we’re going to amplify it. That may mean adding it to corporate processes, communicating it to a wider audience, automating it, releasing it to production, etc.. In the complex space doing the same thing over and over results in different outcomes because of the changing contexts, but once we understand cause and effect, we can start to test out that correlation in different or larger contexts, and develop expertise, moving it into the complicated domain.

We also need to have a strategy for recovery in case of failure. This doesn’t mean that we avoid failure!

I’ve seen a lot of people try to analyze their way out of every failure mode. One of my clients said, “Oh, but what if people don’t have the skills to do this experiment?” Does it matter? If the investment in the experiment is small enough (which is also part of being safe to fail) then all we need to know is that failure is safe; that we can recover from people not having skills. We don’t have to put everything in place to ensure success… and perhaps good things will happen from people being forced to gain skills, or using their existing skills to create a novel approach! This is the nature of experiments; that we don’t know the outcome, only that it has coherence, which means a reason for believing its impact, whatever it is, might be positive. More on that later.

If you can think of a scenario in which the experiment succeeds, can you think of how to make it succeed a second time, and then a third?

If you can think of a scenario in which it fails, can you think of how to make that failure safe (preferable to worrying about how to avoid the failure)? I find the evil hat very handy for thinking up failure scenarios.

Indications of failure; indications of success

In order to put into place our amplification or recovery strategies, we need to be able to tell whether an experiment is succeeding or failing. Metrics are fantastic for this. Don’t use them as tests, though, and definitely don’t use them as targets! They’re indicators; they may not behave as expected. We can understand the indicators in retrospect, but cause and effect won’t always be correlated until then.

As an example, one group I met decided to experiment to see if they could reduce their bug count by hiring a new team member and rotating an existing team member each week into a bug-fixing role. Their bug count started to go down! So they took another team member and hired another new person… but the bug count started to go up again.

It turned out that the users had spotted that bugs were being fixed, so they’d started reporting them. The bugs were always there! And the count of known bugs going up was actually a good thing.

Rather than thinking of tests, think of scenarios in which you can see the experiment succeeding or failing. Those things which allow you to see it – specific, measureable, relevant signs – will make for good indicators. These indicators will have to be monitored.

Rationale for Experiment

The experiment should be coherent.

This means that there should be a reason for believing the impact will be good, or as Dave Snowden puts it, “a sufficiency of evidence to allow us to progress“.

If you can come up with some realistic scenarios in which the experiment has a positive impact, you have coherence. The more likely the scenario is – and the more similar it is to scenarios you’ve seen in the past – then the more coherent it becomes, until the coherence is predictable and you have merely complicated problems, solvable with expertise, rather than complex ones.

To check that your scenarios are realistic, imagine yourself in the future, in that scenario. Where are you when you realise that the experiment has worked (or, if checking for safe failure, failed)? Who’s around you? What can you see? What can you hear? What colour are the walls, or if you’re outside, what else is around? Do you have a kinesthetic sense; something internal that tells  you that you’ve succeeded, like a feeling of pride or joy? This well-formed outcome will help you to verify that your scenario is realistic enough to be worth pursuing.

If you can’t come up with any scenarios in which you can imagine a positive impact, then your experiment is not coherent, and you might want to think of some other ideas.

Look out for a blog on how to do that with a shallow dive into chaos soon!

Posted in cynefin, evil hat, Uncategorized | 3 Comments

A Little Tense

Following on from my last blog post about deriving Gherkin from conversations, I wanted to share some tips on tenses. This is beginner stuff, but it turns out there are a lot of beginners out there! It also isn’t gospel, so if you’re doing something different, it’s probably OK.

Contexts have happened in the past

When I phrase a context, I often put it in the past tense:

Given Fred bought a microwave

Sometimes the past has set up something which is ongoing in the present, but it’s not an action as much as a continuation. So we’ll either use the present continuous tense (“is X-ing”) or we’ll be describing an ongoing state:

Given Bertha is reading Moby Dick

Given Fluffy is 1 1/2 months old

It doesn’t matter how the context was set up, either, so often we find that contexts use the passive voice for the events which made them occur (often “was X-ed” or “has been X-ed”, for whatever the past tense of “X” is):

Given Pat’s holiday has been approved

Given the last bar of chocolate was sold

Events happen in the present

The event is the thing which causes the outcome:

When I go to the checkout

When Bob adds the gig to his calendar

I sometimes see people phrase events in the passive voice:

When the last book is sold

but for events, I much prefer to change it so that it’s active:

 When we sell the last book

When a customer buys the last book

This helps to differentiate it from the contexts, and makes us think a bit harder about who or what triggers the outcome.

Outcomes should happen

I tend to use the word “should” with outcomes these days. As well as allowing for questioning and uncertainty, it differentiates the outcome from contexts and events, which might otherwise have the same syntax and be hard to automate in some frameworks as a result (JBehave, for instance, didn’t actually care whether you used Given, When or Then at the beginning of a step; it just told it there was a step to run).

Then the book should be listed as out of stock

Then we should be told that Fluffy is too young

I often use the passive voice here as well, since in most cases it’s the system producing the outcome, unless it’s pixies.

And that’s it!

Posted in bdd | Leave a comment