Minimizing catastrophic failures in ThinkAgile CP

Instead of clusters, we use the notion of a migration zone (MZ) in ThinkAgile CP. A migration zone is a set of compute nodes among which an application (VM) may migrate.

Unlike clusters, the number of nodes in a migration zone can be in the 1000s. Adding a node to a migration zone is very simple, simpler than adding a node to a cluster1.

We have designed migration zones in such a way that the likelihood of a catastrophic failure that takes out an entire migration zone is even smaller than the probability of losing an entire cluster in clustered designs. Metadata is distributed across the storage controller pairs and the SaaS portal. Loss of both storage controllers can result in the localized outage of that storage block, but not a migration zone-wide outage as the other storage blocks are still operational. Portal metadata about an migration zone is stored across multiple availability zones and backed up every four hours. It would take a catastrophic failure across two availability zones of a public cloud provider like AWS for ThinkAgile CP to lose portal metadata related to migration zones. Even if that happens, we can still recover migration zone metadata from the backup. We may lose up to four hours worth of UI actions (such as create a VM, create a vDisk, create a new firewall rule, and so on), but we would not lose all customer metadata even in this catastrophic scenario, only those related to vDisks created in the last four hours.

1 When a node is added to a migration zone, other nodes do not need to know. In contrast, cluster implementations require that every node in a cluster needs to know about every other node in the cluster, and high frequency heartbeats between nodes in a cluster, typically using a private cluster network, is used to maintain the cluster.