Tuesday, October 05, 2010

Make your data always available - 30 minutes to set up

Having worked in technical support for a number of years, there is one thing that always stressed me out the most: The sound of a pager in the wee hours of the morning. That beeping sound is the most annoying one I've experienced in my IT career so far; it normally signaled an emergency that needed to be fixed asap, no matter how many hours it would take.

Like me, I don't think DBAs like to be paged in the middle of the night when there is a system failure. With DB2 database servers, you can minimize disruptions to your life and ensure to keep your boss and your company happy by investing 30 minutes of your time; it's that simple! Thirty minutes is all it takes to set up HADR (High availability disaster recovery) feature available with DB2. The technology used in HADR has been around for more than a decade. We based it on Informix (another IBM database server) and it is very robust.

The basic requirements for HADR are:

  • Two servers running DB2 (at least DB2 Express edition) using the same operating system, and same storage setup.
  • The DB2 fixpack levels may be different as well as the operating system maintenance levels
  • A network connection between the two servers.

Now let me explain how it works with a few figures. HADR works at the database level. Figure 1 shows a client at the top, and two servers using HADR and in sync at the bottom.

Figure 1 - How HADR works - Part 1

In the figure, the Primary server (on the left) is the DB2 server that handles all your transactions; the Standby server (on the right) is the DB2 server ready to take over the workload in case the primary server goes down.

In (1) the client performs transactions which are sent to the DB2 primary server. In (2) the DB2 primary server processes the transactions and stores the modified data on disk. The operations performed, for example an UPDATE statement, are also logged in files (log files), which are also stored on disk. In (3), the log files are sent to the Standby server which processes them in (4). This way the Standby is kept in sync with the Primary.

Figure 2 illustrates what happens when for any reason the Primary server crashes as shown in (1).

Figure 2 - How HADR works - Part 2


In (2) Tivoli System Automation (TSA) which is included with the HADR feature detects that the primary server crashed, and thus, executes a "takeover" command. The takeover command switches the roles between primary and standby servers. Note in the figure, the server on the right is now identified as the primary. In (3) when client attempts to reach the server on the left fail, the "Client reroute" feature included with DB2 kicks in, which automatically reroutes the client's work to the server on the right. Work then continues as usual and in (4) transactions are processed and stored on disk. This whole process takes only 15 seconds!

Figure 3 illustrates what happens when the server on the left is restarted or fixed. It shows the server on the left is now the Standby and the server on the right is now the Primary. The steps shown in Figure 3 are identical to Figure 1 just reversed. The Standby server on the left will first catch up by applying log files sent by Primary from the time the crash happened.

Figure 3 - How HADR works - Part 3

You can also use the standby server for read-only operations for example for reporting. This is an added benefit because it ensures the primary server is used mainly for transactions (OLTP) workloads, while the standby takes care of reporting (DSS/OLAP) workloads. If a takeover operation is required, readers are blocked from the standby.

Another added benefit of HADR is that it allows you to perform rolling upgrades without disrupting your operations. This means you can suspend HADR to apply maintenance to one server first (say the standby) while operations are continuing on the primary. You can then start HADR again so the two servers are put in sync, and then switch the roles with a manual takeover command. Suspend HADR again, and then apply maintenance on the other server.

This feature provides a lot of value to any company and can be easily set up in minutes from the DB2 Control Center or IBM Data Studio. It allows for high availability as described above, when both servers are sitting on the same location. If you want this feature to support Disaster Recovery, all you need is place the standby server in another location (another building, state, country or continent). This is why the feature is called HADR.

There are many other options and scenarios with HADR. For example, you can set up HADR in the Cloud. I will also be talking about HADR next week at the developerWorks CC4D (Cloud Computing for developers) event. My session is titled "Database design for multi-tenancy and resiliency". HADR fits the "resiliency" part.

Though HADR is not available with DB2 Express-C, it is available with DB2 Express which has also other benefits.

For more details about HADR and other high availability and disaster recovery scenarios, watch the replay of this Chat with the labs event.

Cheers, Raul.

3 comments:

  1. fantastic read!
    well written crisp primer on HADR....

    ReplyDelete
  2. Hope now I am clear with HADR...

    ReplyDelete
  3. please note that HADR Read On Standby Feature is only available after V97 Fixpack 1.
    So if you're still on v97 , you won't be able to use this feature.

    ReplyDelete