Mistakes were made

I generally avoid Reddit, I figure I have enough things in my life sucking my time. But from time to time one link comes across my screen that I find interesting. This is one of them.

The user accidentally deleted a production database. Now, I think we can all agree that deleting a database in production is a “bad thing”. But, whose fault is this really?

Yes, one could argue the employee should have been more careful, but, let’s backup.

The respondents in the thread raise several good points.

  • Why were the values in the documentation ones that pointed to a LIVE, production database? Why not point to a dev copy or even better yet, one that doesn’t really exist. They expect the person to update/change the parameters anyway, so worst case if they put in the fake ones in is, nothing happens.
  • Why didn’t Production have backups? This is a completely separate question, but a very important one!
  • Why fire him? As many pointed out, he had just learned a VERY valuable lesson, and taught the company a very valuable lesson too!

I’ll admit, I’d something similar in my career at one of my employers. Except I wasn’t an intern, I was the Director of IT, and my goal in fact WAS to do something on the live database. The mistake I made was a minor one in execution (reversed the direction of an arrow on the GUI before hitting the Execute button) but disastrous in terms of impact. And of course there wasn’t a recent enough backup.

I wasn’t fired for that.  I did develop and enforce our change control documents after that and always ensured, as much as possible that we had adequate backups. (Later in a my career, a larger, bigger muckup did get me… “given the opportunities to apply my skills elsewhere”, but there were other factors involved and I arguably wasn’t fired JUST for the initial issue.)

As the Director of IT, I made a point of telling my employees that story. And I explained to them, that I expected them to make mistakes. If they didn’t they probably weren’t trying hard enough. But I told them the two things that I wouldn’t accept would be lying about a mistake (trying to cover it up, or blame others, etc) and repeatedly making the same mistake.

I wrote in an earlier post that mistakes were avoidable. But as I pointed out, it’s actually more complex than that. Some mistakes are avoidable. Or, at least they can be managed. For example, it is unfortunately likely that at some point, someone, somewhere, will munge production data.  Perhaps they won’t delete it all, or perhaps they’ll do a make a White Ford Taurus type mistake, but it will happen.  So you have safeguards in place. First, limit the number of people in a position to make such a mistake. Second, have adequate backups. There’s probably other steps you can do to reduce the chance of error and mitigate it when it does eventually happen. Work on those.

But don’t fire the person who just learned a valuable lesson. They’re the one that is least likely to make that mistake again.  Me, I’d probably fire the CTO for not having backups and for having production values in documentation like that AND for firing the new guy.

Advertisements

Small Disasters

Today was an interesting confluence of events. I was exchanging emails with an associate who is in the middle of getting a Master’s in Disaster Management and we were talking about scale and scope of disasters.

At about the same time I was monitoring email from one of my clients. The thread started out with a fairly minor report: Viewpoint Drive – Water Main Break. Not a huge, earth shattering disaster. Simply a notice that there was a waterline break in a nearby road and asked people if they noticed issues to let management know.

Within an hour there was a follow-up email stating that there was no longer adequate water pressure in the building and that folks should go home and finish their workday there. Furthermore, employees were told that for the next day the company was securing water bottles for drinking water and would be bringing in portable toilets.

Now, when people think about disasters, often they think about fires and other things that might destroy a building. But, that’s pretty rare.  It’s the other things that companies don’t necessarily plan for. Your company may have adequate backups of all its servers (but are you sure?) but does it have a plan for not having water?

I’ve worked with managers who have basically said, “eh, we can work around that.” Truth is, legally in most cases they can’t. If the building doesn’t have potable water and working sanitation facilities many municipalities won’t allow it to be occupied.

So does your company have a plan? Are the people who can authorize expenditures in on the loop? Who is going to declare a disaster and put the plan into motion? Who will sign for the porta-potties when they show up?  These are some of the things you have to think about.

So disasters about more than just a good set of backups. Sometimes it’s about the toilets. Think about that.

 

Who’s Flying the Plane

I mentioned in an earlier post my interest in plane crashes. I had been toying with a presentation based on this concept for quite awhile.

A little over a month ago, at the local SQL Server User group here in Albany I offered to present for the February meeting. I gave them a choice of topics: A talk on Entity Framework and how its defaults can be bad for performance and a talk on plane crashes and what IT can learn from them.  They chose the latter. I guess plane crashes are more exciting than a dry talk on EF.

In any event, the core of the presentation is based on the two plane crashes mentioned in the earlier post, Eastern Airlines Flight 401, the L-1011 crash in Florida in 1972 and US Airways Flight 1549, the Miracle on the Hudson in 2009.

I don’t want to reproduce the entire talk here (in part because I’m hoping to present it elsewhere) but I want to highlight one slide:

Flight 401 vs 1549

  • Flight 401 – Perfectly good aircraft
  • Flight 1549 –About as bad as it gets
  • Flight 401 – 101 Fatalities/75 Survivors
  • Flight 1549 – 0 Fatalities

Flight 401 had a bad front nosegear landing light and crashed.

Flight 1549 had two non-functional engines and everyone got off safely.

The difference, was good communications, planning, and a focus at all times on who was actually flying the airplane.

This about this the next time you’re in a crisis.  Are you communicating well? How is your planning, and is someone actually focused on making sure things don’t get worse because you’re focusing on the wrong problem.  I touch upon that here when I talk about driving.

The moral: always make sure someone is “flying the plane”.

On Call

I want to pass on a video I’ve finally gotten around to watching:

Dave O’Conner speaks

I’ve managed a number of on-call teams to various levels of success. One point I’d add that makes a difference is good buy-in from above.

He addresses several good points, most of which I would fully agree with and even at various times adopted at my various jobs.

One thing he mentions is availability.  Too often folks claim they need 99.999% uptime. My question has often been “why?” and then followed by, “Are you willing to pay for that?”  Often the why boils down to “umm.. because…” and the paying for it was “no”, at least once they realized the true cost.

I also had a rule that I sometimes used: “If there was no possible response or no response necessary, don’t bother alerting!”.

An example might be traffic flow.  I’ve seen setups where if the traffic exceeds a certain threshold once in say a one hour period (assume monitoring every 5 seconds) a page would go out.  Why? By the time you respond it’s gone and there’s nothing to do.

A far better response is to automate it such that if it happens more than X times in Y minutes, THEN send an alert.

In some cases, simply retrying works.  In the SQL world I’ve seen re-index jobs fail due to locking or other issues.  I like my sleep.  So I set up most of my jobs to retry at least once on failure.

Then, later I’ll review the logs. If I see constant issue of retries I’ll schedule time to fix it.

At one client, we had an issue where a job would randomly fail maybe once a month.  They would page someone about it, who would rerun the job and it would succeed.

I looked at the history and realized simply by putting a delay in of about 5 minutes on a failure and retrying would reduce the number of times someone had to be called from about once a month to once every 3 years or so.  Fifteen minutes of reviewing the problem during a normal 9-5 timeframe and 5 minutes of checking the math and implementing the fix meant the on-call person could get more sleep every month. A real win.

Moral of the story: Not every thing is critical and if it is, handle it as if it is, not as a second thought.

Git ‘r Done (part 2)

Someone recently forwarded the following article to me: “Get Shit Done: The Worst Startup Culture Ever”.  Before reading it I was a bit ready to disagree. (see my previous post on getting stuff done.)

But after reading this article, I have to agree with the premise of the article; and point out I think there’s two different ways of looking at what “Get Stuff Done” can mean.

At my current assignment, a coworker and I were joking about how some people had some many letters after their name like PMP or CAPM, PMI-SP and the like.

So we joked we needed some letters and we settled on GSD – Get stuff done.  At times on this particular project we seemed to be the only ones accomplishing much or caring about accomplishing much. We had one person who was more concerned with the agenda of the meeting every day (yes, daily meetings to see why the project wasn’t getting done.  With 5-6 people in that room, that’s 25 or more person-hours per week of discussing why things weren’t getting done.)

So in that context, “decide what your goal is, and actually GETTING IT DONE” I think “Get ‘r Done” is an important concept.

On the other hand, I have seen (and fallen prey to myself, both as a manager and as a employee) of the “Get ‘r Done” attitude in the above article.

The project above I was working on never got done.  It wasn’t for lack of effort on the part of myself and several others that it didn’t get done.. It was though for the lack of effort on the part of management that it never got done.  At one point they asked me what could be done to make sure the project could be completed on time. I gave them several examples of areas where they could put some pressure on another group to streamline some procedures.

I was basically told that wasn’t going to happen, and that I had to work harder and “get ‘r done”.  At this phase of the project, I needed 4-5 items from another group and the other group had a policy that each item needed a separate ticket.  Each ticket had to be done sequentially and could only be submitted when the previous ticket was closed out.  Oh, and their policy was 2 weeks per ticket.  Period.

So, by my math, that’s 8-10 weeks. That assumes every ticket goes smoothly, which had not been our experience with this other group.

The project due date was in 6 weeks.

So, I was being told to get things done, in an impossible fashion.  Talk about demotivating.

In general, I’ve been my best as a manager, when I’ve been given the tools to let my team get the job done. It may be buying them dinner one night as morale boost. It may be making sure no extra work gets thrust upon them, or keeping certain other managers from trying to add to their work queue. In one case, it was buying a new NAS so we had enough storage space that we weren’t getting paged every night about diskspace issues. When properly motivated, people can move mountains and better yet, can often do it in a normal work week.

So, if you want to get it done, make sure your team has the tools to do their job, aren’t being distracted, and aren’t being given reasons to have low morale.  They’ll move mountains for you. But ask them to work harder without any of the above, and sooner or later you’ll find yourself without a team, and your boss simply asking you to work harder!

By the way, on that NAS, I think that $5K investment probably helped keep a key employee of mine from jumping ship for greener pastures.  That NAS was probably a better investment than if we had tried to offer him another $5K to keep him happy despite the lack of sleep from all the pages and other issues.

Moral: You want them to “get ‘r done”, give them the tools they need, remove barriers and keep morale up.  They’ll get it done.