Don’t Break the Chain!

If one backup is good, two is better right?

Not always.

Let me start by saying I’ve often been very skeptical of SQL Server backups done by 3rd party tools. There’s really two reasons. For one, many years ago (when I first started working with SQL Server) they often simply weren’t good. They had issues with consistency and the like. Over time and with the advent of services like VSS, that issue is now moot (though, I’ll admit old habits die hard).

The second reason was I hate to rely on things that I don’t have complete control over. As a DBA, I feel it’s my responsibility to make sure backups are done correctly AND are usable. If I’m not completely in the loop, I get nervous.

Recently, a friend had a problem that brought this issue to light. He was asked to go through their SQL Server backups to find the time period when a particular record was deleted so they could develop a plan for restoring the data deleted in the primary table and in the subsequent cascaded deletes. Nothing too out of the ordinary. A bit tedious, but nothing too terrible.

So, he did what any DBA would do, he restored the full backup of the database for the date in question. Then he found the first transaction log and restored that.  Then he tried to restore the second transaction log.

The log in this backup set begins at LSN 90800000000023300001,  which is too recent to apply to the database. An earlier log backup that  includes LSN 90800000000016600001 can be restored.

Huh? Yeah, apparently there’s a missing log.  He looks at his scheduled tasks. Nope, nothing scheduled. He looks at the filesystem.  Nope, no files there.

He tries a couple of different things, but nope, there’s definitely a missing file.  Anyone who knows anything about SQL Server backups, knows that you can’t break the chain. If you do, you can’t get too far. This can work both ways. I once heard of a situation where the FULL backups weren’t recoverable, but they were able to create a new empty database and apply five years worth of transaction logs. Yes, 5 years worth.

This was the opposite case. They had the full backup they wanted, but couldn’t restore even 5 hours worth of logs.

So where was that missing transaction log backup?

My friend did some more digging in the backup history files in the MSDB and found this tidbit:

backup_start_date backup_finish_date first_lsn last_lsn physical_device_name
11/9/2016 0:34 11/9/2016 0:34 90800000000016600000 90800000000023300000 NUL

There was the missing transaction backup.  It was a few minutes after the full backup, and definitely not part of the scheduled backups he had setup.  The best he can figure is the sysadmin had set SAN Snapshot software to take a full backup at midnight and then for some reason a transaction log backup just minutes later.

That would have been fine, except for one critical detail. See that rightmost column (partly cut-off)? Yes, ‘physical_device_name’. It’s set to NUL.  So the missing backup wasn’t made to tape or another spot on the disk or anyplace like that. It was sent to the great bit-bucket in the sky. In other words, my friend was SOL, simply out of luck.

Now, fortunately, the original incident, while a problem for his office, wasn’t a major business stopping incident. And while he can’t fix the original problem he was facing, he discovered the issues with his backup procedures long before a major incident did occurr.

I’m writing about this incident for a couple of reasons.  For one, it emphasizes why I feel so strongly about realistic DR tests.  Don’t just write your plan down. Do it once in awhile. Make it as realistic as it can be.

BTW, one of my favorite tricks that I use for multiple reasons is to setup log-shipping to a 2nd server.  Even if the 2nd server can never be used for production because it may lack the performance, you’ll know very quickly if your chain is broken.

Also, I thought this was a great example of where doing things twice doesn’t necessarily make things less resistant to disaster. Yes, had this been setup properly it would have resulted in two separate, full backups being taken, in two separate places. That would have been better. But because of a very simple mistake, the setup was worse than if only one backup had been written.

I’d like to plug my book: IT Disaster Response due out in a bit over a month. Pre-order now!

Advertisements

Small Disasters

Today was an interesting confluence of events. I was exchanging emails with an associate who is in the middle of getting a Master’s in Disaster Management and we were talking about scale and scope of disasters.

At about the same time I was monitoring email from one of my clients. The thread started out with a fairly minor report: Viewpoint Drive – Water Main Break. Not a huge, earth shattering disaster. Simply a notice that there was a waterline break in a nearby road and asked people if they noticed issues to let management know.

Within an hour there was a follow-up email stating that there was no longer adequate water pressure in the building and that folks should go home and finish their workday there. Furthermore, employees were told that for the next day the company was securing water bottles for drinking water and would be bringing in portable toilets.

Now, when people think about disasters, often they think about fires and other things that might destroy a building. But, that’s pretty rare.  It’s the other things that companies don’t necessarily plan for. Your company may have adequate backups of all its servers (but are you sure?) but does it have a plan for not having water?

I’ve worked with managers who have basically said, “eh, we can work around that.” Truth is, legally in most cases they can’t. If the building doesn’t have potable water and working sanitation facilities many municipalities won’t allow it to be occupied.

So does your company have a plan? Are the people who can authorize expenditures in on the loop? Who is going to declare a disaster and put the plan into motion? Who will sign for the porta-potties when they show up?  These are some of the things you have to think about.

So disasters about more than just a good set of backups. Sometimes it’s about the toilets. Think about that.

 

Testing

This ties in with the concept of experimentation. Thomas Grohser related a story the other night of a case of “yeah, the database failed and we tried to do a restore and found out we couldn’t.”

Apparently their system could somehow make backups, but couldn’t restore them. BIG OOPS.  (Apparently they managed to create an empty database and replay 4.5  years of transaction logs and recover their data. That’s impressive in its own right.)

This is not the first time I’ve worked with a client or heard of a company where their disaster recovery plans didn’t pass the first actual need of it. It may sound obvious, but companies need to test the DR plans. I’m in fact working with a partner on a new business to help companies think about their DR plans. Note, we’re NOT writing or creating DR plans for companies, we’re going to focus on how companies go about actually implementing and testing their DR plans.

Fortunately, right now I’m working with a client that had an uncommon use case. They wanted a restore of the previous night’s backup to a different server every day.

They also wanted to log-ship the database in question to another location.

This wasn’t hard to implement.

But what is very nice about this setup is, every 15 minutes we have a built-in automatic test of their log-backups.  If for a reason log-backups stop working or a log gets corrupt, we’ll know in fairly short time.

And, with the database copy, we’ll know within a day if their backups fail.  They’re in a position where they’ll never find out 4.5 years later that their backups don’t work.

This client’s DR plan needs a lot of work, they actually have nothing formal written down. However, they know for a fact their data is safe. This is a huge improvement over companies that have a DR plan, but have no idea if their idea is safe.

Morale of the story: I’d rather know my data is safe and my DR plan needs work than have a DR plan but not have safe data.

Documentation

Do it, it’s important.

Ok, I suppose I should expand a bit upon that and in this case add an actual example.

So last night, I again attended the local SQL Server User Group meeting. The talk this month was by Ray Kim and was on Documentation for Techies.  While we all agree that documentation is good, it’s sort of interesting how rare most techs actually do it. Ray’s talk covered some of this and further talked about exactly how valuable it is. In addition, several audience members spoke about how proper documentation saved their company a great deal of money simply by giving their tech support people the ability to answer questions in a far faster form.

I got thinking about some of the clients I’ve worked for and how I’ve wanted to document stuff, but often they have very little actually setup in the way of procedures to handle documentation. This is unfortunate, because it can cost them money. For example, for a client right now I’m working on automating a task.  It turns out that there’s not much documentation, so I’m basically struggling to figure things out as a I go.

One thing you hear tech folks talk about a lot is “oh the code is self-documenting”. And sometimes it is.  Since I work in SQL, often, but not always it’s clear what the code is doing. For example

Select firstname, lastname from Clients where ClientID=@ClientID

probably doesn’t need a comment saying what it does.  It’s pretty clear.  But a more complex query might need some commenting, or it may need some explanation as why a particular approach was taken. For example I was recently writing a stored procedure where the where clause was not quite what one would expect if one were to naively write it in the most obvious manner.  However, the obvious manner would have resulted in a table scan of a very large table. By writing what I did, I could ensure a seek would occur.

I also had a habit, which after thinking about last night and testing today, I’m going to modify a bit. Often I’d write procedures such as:

-- Usage: Exec FOO
-- Author: Greg D. Moore
-- Date: 2016-03-15
-- Version: 1.0
-- This simply returns bar when executed
if OBJECT_ID('foo', 'p') is not null drop procedure foo
go
create procedure foo
as
select 'bar'
go

Now, note technically this is a script (T-SQL) that will drop and then create the procedure, so it’s more than just the script. But it’s useful for me because I can ensure I’m running the latest and greatest and drop the old one if it exists before running it.

But, last not got me thinking. What happens if 3 years down the road someone comes along and needs to edit my code. Let’s say the client didn’t do a good job of keeping track of source code and they have to extract the scripts to create the procedures from SQL Server itself using say SSMS?

The results end up looking much more like this:

USE [Baz]
GO
/****** Object:  StoredProcedure [dbo].[foo]    Script Date: 03/15/2016 10:47:22 ******/
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[foo]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[foo]
GO
USE [Baz]
GO
/****** Object:  StoredProcedure [dbo].[foo]    Script Date: 03/15/2016 10:47:22 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
create procedure [dbo].[foo]
as
select 'bar'
GO

Ignore the extra USE statements and the SSMS generated comments and SET statements. Notice my comments are gone.  This actually makes sense because in the first script, the comments occur before a GO statement so the SQL engine interprets them as completely separate from the statements to create the actual stored proc.  All my useful comments are now history.

BUT, there’s a simple solution. Move the comments to after the first GO statement.

if OBJECT_ID('foo', 'p') is not null drop procedure foo
 
go
 
-- Usage: Exec FOO
-- Author: Greg D. Moore
-- Date: 2016-03-15
-- Version: 1.0
-- This simply returns bar when executed
-- Version: 1.1
-- Comments moved below GO statement
 
create procedure foo
as
 
select 'bar'
go

Now if I use SSMS to generate my script I get:

USE [Baz]
GO

/****** Object: StoredProcedure [dbo].[foo] Script Date: 03/15/2016 10:48:53 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[foo]’) AND type in (N’P’, N’PC’))
DROP PROCEDURE [dbo].[foo]
GO

USE [Baz]
GO

/****** Object: StoredProcedure [dbo].[foo] Script Date: 03/15/2016 10:48:53 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

— Usage: Exec FOO
— Author: Greg D. Moore
— Date: 2016-03-15
— Version: 1.0
— This simply returns bar when executed
— Version: 1.1
— Comments moved below GO statement

create procedure [dbo].[foo]
as

select ‘bar’

GO

Now my great documentation is preserved. This is a small thing but down the road could save the next developer a lot of trouble.

So, stop and think about not only documentation, but how to make sure it’s preserved and useful in the future.

Who’s Flying the Plane

I mentioned in an earlier post my interest in plane crashes. I had been toying with a presentation based on this concept for quite awhile.

A little over a month ago, at the local SQL Server User group here in Albany I offered to present for the February meeting. I gave them a choice of topics: A talk on Entity Framework and how its defaults can be bad for performance and a talk on plane crashes and what IT can learn from them.  They chose the latter. I guess plane crashes are more exciting than a dry talk on EF.

In any event, the core of the presentation is based on the two plane crashes mentioned in the earlier post, Eastern Airlines Flight 401, the L-1011 crash in Florida in 1972 and US Airways Flight 1549, the Miracle on the Hudson in 2009.

I don’t want to reproduce the entire talk here (in part because I’m hoping to present it elsewhere) but I want to highlight one slide:

Flight 401 vs 1549

  • Flight 401 – Perfectly good aircraft
  • Flight 1549 –About as bad as it gets
  • Flight 401 – 101 Fatalities/75 Survivors
  • Flight 1549 – 0 Fatalities

Flight 401 had a bad front nosegear landing light and crashed.

Flight 1549 had two non-functional engines and everyone got off safely.

The difference, was good communications, planning, and a focus at all times on who was actually flying the airplane.

This about this the next time you’re in a crisis.  Are you communicating well? How is your planning, and is someone actually focused on making sure things don’t get worse because you’re focusing on the wrong problem.  I touch upon that here when I talk about driving.

The moral: always make sure someone is “flying the plane”.

GIGO

A huge tenet of programming is GIGO: Garbage In/Garbage Out.

Years ago when I was practicing for a play (Night of January the 16th by Ayn Rand). I was the bailiff.  At one point in the play I’m handed a copy of a check that is evidence. I’m supposed to “read” what’s on the check. Of course since it’s a play, I have my lines memorized.

But during this dress rehearsal I’m given a piece of paper with actual writing on it. Unfortunately it was just some random writing. But my brain went into segfault and I stopped. Part of my brain wanted to read what’s on the piece of paper.  Part of my brain wanted to say my lines, but it could no longer remember them.

It was a perfect example of how easy it is to scramble the input for our brains.  In the actual performances we made sure the piece of paper was actually blank.

I was reminded of this the other night when Steve Harvey made his gaff on live television. I was curious how he could make such a mistake but I had my suspicions. And I was right.  The cue card apparently was VERY poorly designed and his visual input system (i.e. his eyes and brain) screwed up. Read here for more details. Bad input lead to bad output.

These are humorous examples, but in the software world, these can be very dangerous.

At one point during the shuttle program, they found an error where the arm thought it had rotated more than 360 degrees, a physical impossibility. This link has some details (though in my recollection the issue was not a rounding error but that the code went from 0-360 instead of 0-359 or 1-360).  Garbage in could have lead to potentially bad garbage out.

Much more recently however, here’s an example of intentional “garbage” in. This is part of the encryption software used on many firewalls. Your bank or other financial institution for example may be using this code.

Ironically true garbage, as in a purely random number, might be better. But here it seems someone poisoned the input with their specific number and then set it up to use the results in a dangerous manner. I say dangerous because the 3rd party using this code may not realize that they’re completely vulnerable to having all their data seen.  About the only thing worse than unencrypted data is data you think is encrypted but isn’t.  In the former, I’m probably going to pay far more attention to who has access.  I’ll add too that some of us suspect the NSA had a hand in this.

This is by the way I highly recommend folks don’t write their own encryption. Unless you’re an expert you’re liable to screw it up.

Moral: So be careful of your inputs, they definitely influence your outputs, both in code and in your brain.

Never run out of a plan

I’ve actually been meaning to blog about this for awhile, but have been putting it off, so here goes.

I’ve mentioned in the past my analogy of “flying the plan”. Lately I’ve been spending a lot of time on a site called Quora. It’s quite a fun site and I’ve learned quite a bit.

But this particular question I think is a great one for life in general.

Scrolling down, you’ll see a post from Jim Mantle. I want to take a quote from his answer:

There have been many air crashes where a problem was being worked by both pilots, neither was flying the aircraft, and they had a Very Bad Day.

If you read about the L1011 Crash you’ll see the real mistake was failing to actually fly the plane. The crew was so engrossed in solving the problem of a burnt-out landing gear light that they missed the fact that the plane was flying into the ground.  A simple burned out bulb and 101 people died.

Compare that to the Miracle on the Hudson where the pilots had a MUCH worse problem (lack of power in either engine) and managed to bring the plane down safely without any loss of life.

He also has good advice that he repeats often “Keep calm”.

I also want to quote Dirk Van Der Walk who later says:

You can run out of height, you can run out of engine, but one thing you can never run out of, is a plan. You must always have a Plan B.

I had a client a few years ago that had called me in to implement a specific change in their infrastructure.  There was also a fairly specific timetable by which it had to be done.

I met with CTO about once a month to go over the status of the project.  At one point it became clear that due to certain corporate policies, it would take about 12 weeks to get to a certain milestone in the project.  Unfortunately the schedule demanded we be there in about 8 weeks.

He asked me what we could do.  I explained I had no control over the corporate policies and that we should start to consider a Plan B.  I’m quite proud that I kept my jaw from hitting the floor when he uttered his next sentence.

This is no plan B and there can’t be a plan B.

This is an example of taking the mantra “Failure is not an option” to a whole new level.

Ironically I was there about a month later when the CTO was basically called out on the carpet for the status of the project and when it was clear he had no plan B, the corporate folks spent the next 24 hours designing a plan B.

In part this wasn’t too hard because the internal people on the project had already had several plan B’s in their mind.

It was only because others did have a plan B that we were able to save any real semblance of the original goal.

Moral of the story: always have a backup plan.  And start thinking about a backup plan to the backup plan.