Planning for High Availability

High availability means that a system can run continuously without interruptions or so-called downtime, i.e. the system is always available.

For many business-critical systems, having high levels of system availability is a must to make sure that single-points-of-failure (SPOF) do not cause interruptions to operations and costly downtime. In the case of the location-based services, the system components and the solution architecture will determine the level and type of availability that is needed.

This article will cover some of the typical methods for increasing system availability.

Redundancy and Failover

Redundancy is the addition of duplicates into the system that can take over if the primary component fails. When combined with failover, the transfer of control to the duplicate in case of failure, a single component can no longer cause the failure of the whole system environment. For example, you can add a second server to your network setup to take over in cases where your primary server is damaged or loses connection. Automating the failover (i.e., the switchover between servers) enables the secondary server to constantly monitor the primary server and immediately take over if it detects a problem. This way, the system runs seamlessly without downtime, even if the primary server fails.

Similarly, duplicate infrastructure components can be added to mission-critical areas to ensure constant coverage, even in the case of unexpected component failure (for more about coverage, see: Bluetooth for RTLS: Coverage vs Range). For example, the Locator constellation can be carefully planned to ensure that the system can continue to track items even if unexpected infrastructure failures occur.

The failover between the components, either automatic or manual, needs to be reliable and designed to make sure unwanted downtime does not occur.

Telemetry and monitoring systems

Telemetry and monitoring systems also play an important part in making sure that systems run smoothly. Telemetry automatically collects and transfers data to monitoring systems for analysis and action. Both passive and active telemetry and monitoring systems can be effective in increasing system availability.

Passive systems are reactive to failures that have already occurred. In other words, the system will trigger an alert once it detects that a component or the whole network has gone down. For example, a power outage at a warehouse could cause the server and the network to shut down, which in turn could cause the asset tracking system to shut down. At this point, the monitoring system would identify that a failure has occurred and would send an alert to the user. This automatic alert can significantly reduce downtime as the user can take action to remedy the situation immediately.

Active systems, on the other hand, proactively and continuously monitor system performance in search of potential weaknesses, so they can be fixed before actual failures occur. For example, the system could identify that disk space is running low and automatically trigger an alert before the disk is full. This enables predictive maintenance, changing the disk in this case, to take place so that failure and unplanned downtime are avoided completely.

Regular Updates

In general, it is good practice to plan for systematic updates of all of the components in the system environment. Running the latest software and firmware releases can improve system performance and prevent unwanted interruptions. They are also typically a prerequisite for receiving support services from the system provider.

To reduce uncertainty about how new updates will affect your system, it is recommended that you set up a test environment to assess their effects before upgrading commercial environments.

Being Prepared for the Unexpected

While building redundancy and adding telemetry & monitoring systems improves availability, it is still a good idea to have a plan B in mind just to make sure your operations always run smoothly. For example, having a small stock of spare parts for key components such as networking, Locators and tags can allow quick turnaround times if a device is damaged unexpectedly. This helps avoid delivery delays and allows for immediate physical installation.

It is also a good idea to think about what happens in case of emergencies or other unforeseeable circumstances. For example, what happens if a fire starts in your facility? Creating redundancy by adding duplicate devices is good for everyday operations, but if both are installed in the same physical space (e.g., the same room), they will both be damaged by the fire. Where possible, the failover replica should be installed in a different location, far away from the original, so that the system can keep running even if the primary component is destroyed by the fire.

Planning for power cuts might also be necessary to reduce unwanted downtime. For example, in many healthcare cases, the shutting down of life-saving equipment due to a power cut is simply not an option. In these types of cases, uninterruptible power supplies (UPS) or dedicated power generators are often used to make sure that the system can continue to run using the temporary power supply until the power outage is over.

High availability is valuable for many businesses. When evaluating the level of availability needed for your system, it is good to weigh the additional investment against the potentially much higher financial cost of system downtime. Some key questions to ask at this point are:

How much unplanned downtime can your business tolerate?
How much can your business invest in increasing system availability?

The answers to these questions will be different for every business. For help in assessing your case requirements, please contact your local Quuppa Partner.