This week at Redacted Financial we had a “town hall” style meeting that included the entire software engineering organization. The meeting was run lean coffee style. No work-related topic was off limits. Our director made himself available for 2 hours to answer any questions about why our organization is run the way it is.
- Why we have a separate project management group.
- The rationale behind our work-from-home policy.
- Why we tend toward project teams rather than product teams.
- How we can address technical debt & product direction using project teams.
- Why our teams are sized the way they are.
- Maintenance and dissemenation of current best practices, standards, and guidelines for different kinds of softare.
This meeting is similar to a retrospective at the end of a project or sprint. It is different in that there are things that we can’t change because they are constraints imposed on us by the business, but understanding where those lines are and what we can change was valuable to everyone. One of the things that we are changing is our work-from-home policy. Those who advocated for a more liberal policy are involved in a small team working with the Director to write up a new policy.
Our Director offered to have a version of this meeting annually. The team immediately objected “We should have this meeting quarterly!” So we are. We’re going to allocate 1hr per quarter for this style of meeting.
The value-add as I see it is that the team can be involved in changing how their organization works within the larger enterprise. Where there are barriers, the team can be informed about what they are and why they’re there. For those who feel strongly that a given barrier shouldn’t be there, they can direct their attention to problem-solving how to remove the barrier while still accomplishing the business objectives that caused the barrier to be erected in the first place. A side-effect of empowering team members to make change is that it reduces the amount of work management has to do to identify opportunities for change and push the through.
The town hall meeting was a very positive experience for our team. I wonder if anyone else has done anything similar and what their results have been.
I recently came across this tweet:
The most scathing (and generally valid) criticism of "Agile" and Scrum I've ever read. Please read and RT.https://t.co/Pc85bBdiiA
— Michael Ibarra (@bm2yogi) April 5, 2016
Read the entire article. It’s fascinating. Here are some choice quotes:
work gets atomized into “user stories” and “iterations” that often strip a sense of accomplishment
from the work, as well as any hope of setting a long-term vision for where things are going.
Instead of working on actual, long-term projects that a person could get excited about, they’re
relegated to working on atomized, feature-level “user stories” and often disallowed to work on
improvements that can’t be related to short-term, immediate business needs (often delivered from
the sorts of projects that programmers want to take on, once they master the basics of the craft,
are often ignored, because it’s annoying to justify them in terms of short-term business value.
Technical excellence matters, but it’s painful to try to sell people on this fact if they want,
badly, not to be convinced of it.
Under Agile, technical debt piles up and is not addressed because the business people calling the
shots will not see a problem until it’s far too late or, at least, too expensive to fix it.
Moreover, individual engineers are rewarded or punished solely based on the optics of the current
two-week “sprint”, meaning that no one looks out five “sprints” ahead. Agile is just one mindless,
near-sighted “sprint” after another: no progress, no improvement, just ticket after ticket after ticket.
I can summarize my own views on Scrum by saying “Scrum is a process that teaches a team how not to need Scrum.” I’ll likely expand on that idea later, but it is not the focus of this post.
Short Term Planning
A common theme running through many of Church’s criticisms is the failure to plan long-term. Larger engineering tasks do not get done because they cannot fit into an arbitrarily short time-frame. Technical debt gets ignored. Software rots and it is regarded as normal.
Church places the blame on Scrum and appears to argue that engineers should be in charge of the project planning so that they can take on the larger, longer-term, more-expensive iniatives that will ultimately save the company money.
Checks & Balances
We’ve tried that. That was the default state of the industry since its inception. Software engineering was akin to black-magic to most people, but for others it’s just grunt-work. (I once had a supervisor say to me in all seriousness “I had a class in C in college. A loop is a loop right? Anybody can do this.”) Software engineers would spend their time building systems and frameworks for features that were ultimately not needed. (They still do. Watch this talk by Christin Gorman for a real-world example in a modern project.) This tendency of engineers to “gold-plate” their code or work on “what’s cool” resulted in huge expense-overruns for software that was delivered late, over-budget, and often did not do what was expected of it.
In an attempt to get some kind of control over engineering projects The Business started taking more direct control over software projects. Since they could not trust the engineers to stay focused on delivering business value, they ruthlessly began cutting anything that did not directly contribute to features they could see. This was a problem because they did not often understand the tradeoffs they were making which resulted in software that did what it was supposed to in v1, but became harder and harder to maintain over time due to poor engineering.
Agile/Scrum attempts to address this problem by creating a clear separation of responsibilities. The Business is responsible for feature definition and prioritization. Development is responsible for estimation and implementation. The Business decides what is built and when but Development is in control of how. Instead of trying to teach The Business software engineering, Development communicates about alternatives in terms of estimates and risks.
Development: If you choose strategy ‘A’, we estimate it will take this long with these risks.
If you choose strategy ‘B’, it will take longer, but have fewer risks.
In order for this division of labor to be successful, each party must commit to it. Introducing a Scrum (or any other) process to a team does not make the team agile. Just because the engineers practice TDD and practices Continuous Integration, Continuous Deployment and other good engineering practices does not mean the team is agile.
The Business is part of the team. It is the Business’ responsibility to do the long-term planning. It is the engineer’s responsibility to communicate about estimates and risk. If either party fails to do their part, it is not a failure of “Agile,” but a failure of the people involved.
You might be tempted to accuse me of the “No True Scotsman” fallacy at this point. If so, then you are missing my point. The Agile Software movement is focused on a philsophy for interacting with The Business which values the contributions of all stakeholders and encourages trust. It is not a prescription for particular processes. No process will succeed if the stakeholders do not buy into the underlying philosophy.
Agile is about the people. It’s right there in the manifesto.
Individuals and interactions over processes and tools
An observant critic might respond at this point by pointing to another pieces of the manifesto:
Responding to change over following a plan
Isn’t this an endorsement of short-term thinking? No. At the time the manifesto was written, it was much more likely that software would be written in months or years-long iterations. This created a problem in that business circumstances would change in ways that diminished the value of planned features before they were delivered. It is not an injunction against planning per se. I call your attention to the last sentence in the manifesto:
That is, while there is value in the items on the right, we value the items on the left more.
To adapt a popular adage, “Plan long-term, work short-term.”
I don’t know many software engineers who entered the profession because they wanted to work with people. However, working with people is an absolute must in just about every profession. If we want the business to think long-term and make solid engineering choices we must learn to communicate with them.
The Responsibility of Communication is bi-directional. We must learn to communicate about estimates and risks. The Business must learn to consider risks as well as the cost. The Business should learn engineering concepts at a high-level (e.g., we can scale better if we use messaging).
It’s my observation that there often is a long-term plan but it is not necessarily communicated. As engineers, we must endeavor to find out what it is. We must be honest when we think we are being pushed to make an engineering mistake. If we feel strongly about an option, we must become salesmen and sell our perspective.
How do I Communicate with The Business?
I once had a product-owner come to me and ask for a feature. My team estimated 2 weeks to implement and test the new feature. He needed the feature in 3 days. He had some technical knowledge and offered a solution that might get the feature done faster. His solution would work, but it involved a lot of bad practices. We reached an agreement in which we would deliver the solution in 3 days using the hacky solution, but our next project would be to implement the feature correctly.
We were able to have this conversation because my team and I had a track record of delivering what was asked for in a working, bug-free state on a consistent basis. In other words, we had trust. It took some time to build that relationship of trust, but not as much as you might think. When I’m communicating with The Business about engineering alternatives, I make sure I answer 3 questions.
- What problem are we trying to solve?
- How long will each alternative take to deliver?
- What are the risks associated with each alternative?
I make sure that The Business and I are in agreement on the answers to all 3 of these questions. Then I accept their decision.
What if They are Still Myopic?
If you’re sure you’ve done everything you can to communicate clearly about short vs. long-term options and risks and if The Business insists on always taking the short-term high-risk solution, then you might be forced to conclude that you don’t like working for that particular business. It might be time to move on.
I hate saying it, but most of the companies I worked for in the early part of my career are places I would not go back to. I started my career in Greenville, SC. The first company I truly enjoyed working for was in Washington DC. I ended up in Seattle, WA where I found a company that does embrace most of my values. Do we have problems? Absolutely. However, with reason and communication as our tools, we are addressing them.
I’ve had a blog for years but my blogging frequencey is intermittent. I wanted to see what a successful blogger would say, so I took John Sonmez’ Simple Programmer Blogging Course. I can’t say that there were any lightning bolt insights. If I had to boil the course down in four
words it would be “get off your ass,” along with some helpful tricks to overcome common roadblocks to getting that next blog written.
Two of my biggest bottlenecks have been trying to figure out what to write about, and perfectionism. “Did I cover everything relevant to [topic]? Did I correctly render the forest and the trees? Am I certain about the advice I gave?” Sonmez’ course gave me techniques to deal with these bottlenecks for which I’m grateful.
If you are at all interested in blogging but a) are not sure how to get started, or b) have felt stalled, I would recommend that you take the course. You’ve got nothing to lose–it’s free!
The Normalization of Deviance is a concept that
describes a situation in which an unacceptable practice has gone on for so long without a serious problem or disaster that this deviant practice actually becomes the accepted way of doing things.
credit: challenger explosion
You’re familiar with this concept already. You’ve encountered everytime you’ve seen someone doing they know to be wrong but justifying it with “we’ve never had a problem before.” This is the person who consistently buys things s/he can’t afford with credit cards. This is the person goes out drinking and drives home. This is the software engineer who writes code without tests. This is the business that ignores through continual deferment technical debt issues raised by its engineers.
As software engineers, we know we should remove dead code in projects. We know that we should automate software deployment. We know we should provide reliable automated tests for our features. We know we should build and test our software on a machine other than our personal dev box. We know that we should test our software in a prod-like environment that is not prod.
Do you do these things in your daily work? Does your organization support your efforts?
How do you identify the Normalization of Deviance?
One of the challenges you will face identifying Normalization of Deviance is the fact that things you do on a daily basis are… normal.
There are several “smells” that could indicate that your organization is having problems.
- You have “official” policies that do not describe how you actually do work.
- You have automation that routinely fails and requires handholding to reach success.
- You have unit or integration tests that fail constantly or intermittently such that your team
does not regard their failure as a “real” problem.
- You have difficult personalities in key positions who turn conversations about their effectiveness
into conversations about your communication style.
All of these issues are “of a kind,” meaning that they are all examples of routinely accepted failure. This is obviously not an exhaustive list.
Why is it a problem?
You will eventually have a catastrophic failure. Catastrophic failures seldom occur in a vacuum. Usually there are a host of seemingly unrelated smaller problems that part of daily life. Catastrophic failure usually occur when the stars align and the smaller issues coalesce in such a way that some threat vector is allowed to completely wreck a process. This is known as the Swiss Cheese Model of failure. I first learned about the Swiss Cheese Model in a book called The Digital Doctor, which is a holistic view of the positive and negative effects of software in the medical world. This book is well-worth reading. The section on the deadly consequences of “alert fatigue” would be of special interest to software engineers and UX designers.
A Real World Example
A fascinating case of software failure that destroyed a company overnight is the story of Knight Capital. In a writeup, Doug Seven lays the blame at the lack of an automted deployment system. I agree, though I think the problems started much, much earlier. Doug writes:
The update to SMARS was intended to replace old, unused code referred to as “Power Peg” – functionality that Knight hadn’t used in 8-years (why code that had been dead for 8-years was still present in the code base is a mystery, but that’s not the point).
It’s my point. The Knight Capital failure began 8 years before when dead code was left in the system. I’ve had conversations with The Business where I’ve tried to justify removing dead code. It’s hard to make them understand what the danger is. “It’s not hurting anything is it? It hasn’t been a problem so far has it?” No, but it will be.
The second nail in Knight Capital’s coffin was that they chose to repurpose an old flag that had been used to activate the old functionality. As Doug Seven writes:
The code was thoroughly tested and proven to work correctly and reliably. What could possibly go wrong?
The final nail is that Knight Capital used a manual deployment process. They were to deploy the new software to 8 servers. The deployment technician missed one. I don’t actually know this, but I can just imagine the technician staying after-hours to do the upgrade and wanting nothing more than to go home to his/her family or happy-hour or something.
At 9:30 AM Eastern Time on August 1, 2012 the markets opened and Knight began processing orders from broker-dealers on behalf of their customers for the new Retail Liquidity Program. The seven (7) servers that had the correct SMARS deployment began processing these orders correctly. Orders sent to the eighth server triggered the supposable repurposed flag and brought back from the dead the old Power Peg code.
There were more issues during their attempt to fix the problem, but none of it would have happened except that these 3 seemingly minor problems coalesced into a perfect storm of failure. The end result?
Knight Capital Group realized a $460 million loss in 45-minutes... Knight only has $365 million in cash and equivalents. In 45-minutes Knight went from being the largest trader in US equities and a major market maker in the NYSE and NASDAQ to bankrupt.
How do you fix it?
I’m still figuring that out. Luckly, Redacted Inc doesn’t have too many of these sorts of problems so my opportunities for practice are few, but here are my thoughts so far.
The biggest challenge in these scenarios is that people are so used to accepting annoying or unreliable processes as normal that they cease to see them as daily failure. It’s not until after disaster has struck that it’s clear that accepted processes were in fact failures. Nobody at Knight Capital was thinking “jeez, that dead code is really hurting us.”
There’s an old management adage: “You can’t manage what you can’t measure.” You can start addressing NOD issues by identifying risky patterns and practices that your organization uses in its daily standard operating procedure. If you can find a way, assign a cost to them. Consider ways in which these normal failures could align to cause catastrophy. If you have a sympathetic ear in management, start talking to them about this. Introduce your manager to these concepts. Tell them about Knight Capital. Your goal is to get management and The Business to see the failures for what they are. By measuring the risks and costs to your organization of acceptable failure you will have an easier time getting your voice heard.
Most importantly, come up with a plan to address the issues. It’s not enough to say “this is a problem.” You need to say “This is a problem and here are some solutions.” Go further still, show how you get your team from “here” to “there.” Try to design solutions that make the day to day work easier, not harder. Jeff Atwood calls this the Pit of Success. His blog scopes this concept to code, but it applies to processes as well. You want your team to “fall into” the right way to do things.
Another potential source of positive feed back are new members to your team. It may be hard to get them to open up for fear of crossing the wrong person, but they are new to your organization and they will see more clearly the things that look like failures waiting to happen. Nothing you are doing is normal to them yet.