Resource Reservation for High Availability Applications

This section describes example approaches that illustrate how to use resource reservation in ThinkAgile CP to ensure High Availability for your critical application instances in the event of node failure.

When a node fails, the virtual machines (VMs), or application instances, from that node are moved to other nodes in the migration zone that have enough resources available. ThinkAgile CP looks at the memory requirements of all application instances the target node, adds to it the memory needs of the VM being moved, and restarts the VM on the target node only if the totals for memory usage are going to be less than 100%. The VM restart will fail if no such node is available.
Note: In the ThinkAgile CP platform, a VM is referred to as an application instance. For more information about application instances, see Working with application instances.

In the following sections, we describe three example approaches that attempt to avoid restart failures for critical application instances. Approach 1 is the highest confidence approach. Approach 2 and Approach 3, described below, will work most of the time and provide higher levels of utilization. In the examples for Approaches 2 and 3, for simplicity of explanation, we assume all nodes in a migration zone are identical (same number of cores and same number of GBs. These proposed approaches would need to be appropriately modified if this assumption is not true.)

In all cases, ensure that the memory is not oversubscribed; that is, make sure the memory subscription factor for all nodes in the migration zone is less than or equal to 1.
Note:

For more information about the concepts of oversubscription and the subscription factor, see Oversubscription in ThinkAgile CP.

Approach 1

In this example approach, all critical application instances are placed in a special category of the migration zone, called CRITICAL. The subscription factor for memory in the CRITICAL category must never exceed (0.5- a small safety factor). The safety factor is used to account for variations in how much the hypervisor itself consumes memory resources. An adequate safety factor might be .05, which means that the subscription factor for memory in the CRITICAL category must never exceed 0.45.

In our example, let n be the number of nodes in the CRITICAL category. We start with n=2. As more critical apps are added, and the subscription factor of the CRITICAL category reaches 0.45, we will need to add another pair of nodes for handling critical application instances, making the number of nodes in the critical CATEGORY n=4. This allows more room to add critical application instances, and the subscription factor of this category will drop well below 0.45 with the addition of the two nodes.

Note:

In the ThinkAgile CP UI, you would go to the Migration Zone, go to the Allocations tab, look at the row for the CRITICAL category, and divide the in-use memory by the total memory to make sure it is less than 0.45.

New critical application instances may now be added. Once again, this will continue until the subscription factor of this category reaches 0.45, at which time we will need to add another pair of nodes, and so on. With this approach, the nodes in the CRITICAL category are not oversubscribed, but the remaining nodes in the migration zone that are running non-critical application instances may be oversubscribed.

Approach 2

In this example approach, you will keep the Subscription Factor for a migration zone below (n-1)/n, where n is the number of nodes in the migration zone. With n=4, we never use more than 75% of the resources until a node failure happens, at which time resource usage goes up to 100%. Similarly, when n=8, we never use more than 87.5% of the resources until a node failure happens. This approach is not guaranteed to work, because while we have reserved enough space overall in the system to handle all the critical application instances, there may not be enough space in any one node to handle a specific one of the critical application instances.

Approach 3

This example approach offers good utilization, works most of the time, but may not be foolproof. In this approach, we need to ensure that the memory load for a migration zone does not get close to (n-2)/(n-1), where n is the number of nodes in a migration zone, and n must be 3 or larger. For example, with n=3, we need to make sure the migration zone load does not get close to 50%. Using 10% as a safety factor, we need to make sure the migration zone load never gets above 50-10 = 40%. Similarly, if n=4, we need to be sure the migration zone load does not exceed 67-10=57%.

The ThinkAgile CP UI allows you to see the migration zone load over the last year. Therefore, it is easy to ensure that the highest load over the last year of running has not exceeded ((n-2)*100/(n-1)) - 10%. Any subscription factor that keeps the migration zone load below this threshold should be acceptable.