... News ... Blog ...

How can a business recover after an IT incident?

DEAC | 03.11.2020

Preventing interruptions in the operation of information systems and infrastructures is virtually impossible. Suffice to say that as cloud technologies become more sophisticated, so do cyber attacks, such as data theft, DDoS, malware and spam; the number of such attacks is rising as well.

A server crashing can have different effects on a business. An entire IT system becoming unavailable could have catastrophic consequences for almost half of companies affected by it.

Standby system elements

In order to reduce the effect of incidents on the availability of IT systems, companies must keep track of the measures adopted to control business risks and business continuity. These measures are usually included as part of specialised tech disciplines such as business continuity management, business continuity planning, and disaster recovery planning.

Disaster recovery planning is more than just data backups: it is a whole set of measures to mirror the critical IT infrastructure, including its physical servers, applications, storage facilities, hypervisors, databases, electric power supply, switches etc. See our previous post to learn about the difference between data backup and disaster recovery.

When setting up disaster recovery, one typically connects the elements in parallel, whereby the system can only fail if all of its elements fail at the same time. This approach to building a system is referred to as ‘standby’. Disaster recovery often involves adding individual system elements (only key elements, according to the standard) as standbys, and depending on your goals and needs, you can distinguish three most frequently mentioned types of standby for disaster recovery: cold, warm and hot.

Cold standby	Warm standby	Hot standby
Transfer of backup copies to the main data centre at set intervals, in order to recover systems and data within the required time. The virtual environment of the standby data centre operates as necessary. The standby environment only engages if there is an incident.	The data and system backups are sent to a geographically remote disaster recovery data centre. The standby data centre is a smaller version of the main one, and the virtual servers operate as the key elements of disaster recovery.	More than one data centre works as a standby environment. A separate environment fully matches the main operating environment of the business (mirror standby), and whenever there is a failure, the company can fully and immediately switch over to the mirrored infrastructure. Backup copies of the entire system and data are generated, updated and stored in real time, keeping the recovery delay minimal.
Advantage: this is a budget option	Advantage: the standby environment does not require any complex settings, but it is often a more economical option.	Advantage: whenever any system element fails, the system switches to another within seconds, taking no longer than the time specified.
Disadvantage: very slow recovery requiring manual configuration after the incident.	Disadvantage: expensive to maintain.	Disadvantage: expensive to build and maintain.

Also interesting: find out about a complete
disaster recovery case in a restaurant chain business.

What happens during an incident?

Let’s take a look at a few incident scenarios, from minor to major, given that these subsystems and their elements take part in disaster recovery.

Incident	Measures to restore operation	Preventive measures before the incident
A power supply unit of the data storage system (DSS) fails	The system immediately switches over to the other power supply units. No interruption in the operation of the data storage system.	Addition of standby power supply units for the DSS can increase the level of availability.
One or two disk drives of the DSS fail	The RAID defence mechanisms engage, and while the DSS might slow down, it functions without interruptions.	Additional disks must be provided when configuring the RAID.
The entire DSS fails	The system immediately switches over to its mirror version in the standby infrastructure. The operation of the system slows down, but is not interrupted.	Mirroring the DSS on standby infrastructure keeps the operation continuous. The standby DSS can be hosted on separate IT infrastructure. For the best security, the recommendation is to host it on a data centre in a different location.
Interruption in the data channel between the main and standby data centre	The system switches over to the standby channel.	More than one data channel is set up. There must be a channel for data synchronisation between the data centres. In order to increase the level of security, you need to set up separate channels between each of the data centres; you can also use a dedicated L2 uplink channel to access the data if the main data centre’s channel fails.
Complete shutdown of one platform due to an emergency (e.g. a fire).	The entire system operation is transferred to the standby platform, with a small delay. If the emergency took place in the standby data centre, then once it is over, the synchronisation of data must be repaired.	Having more than one standby platform will provide the business with the continuity and availability it needs.
Complete shutdown of one platform due to an emergency, with a failure in the data channel connecting it to the standby platform.	The system operation is switched over to a third platform (third data centre) located outside the country.	The greater the distance from that standby data centre, the less risk of incidents affecting it. Having more than one standby data centre increases the level of continuity and process stability.

Hope for the best; prepare for the worst

It is not the individual failures and interruptions, which experienced IT specialists are prepared for, that cause the worst effects for business; it is the emergencies that cause multiple parallel systems to crash. Such emergencies include fires, floods, failures in the financial system of the country, stock market crashes, military conflicts, epidemics (such as the recent coronavirus epidemic) etc.

Preventing irreversible consequences in these situations requires fully isolating the company’s standby assets from these emergencies. Standby assets will also help in running the key services of your business in an emergency. Correct reactions to an IT incident must be specified in the disaster recovery plan of the company: developed in advance, printed in multiple copies, and well-practised; it must be available to all the parties involved. Once it is implemented, it is important to continue regularly testing all the systems used for possible interruptions or failures.

All is fair in love and war, and you can even deploy a standby platform for disaster recovery at the company manager’s house. Companies that get more than half of their business online and can immediately suffer from any IT system failures may find it more economical to launch a standby platform or two to soften the impact of practically any emergency, than to build their own comprehensive solutions and to rely on the robustness of their infrastructure.

Cloud infrastructure for better speed and reliability

The ability to set up virtual machines and even entire clusters of servers in a cloud, using only the resources you need, is not only useful for various development activities, but also for hosting disaster recovery measures. With a reliable cloud, you can set up private connections between the data centres and the infrastructure of the business. The advantages of a cloud (and, importantly, a private cloud) include fast operation, higher security and reliability, and better performance. This is why the disaster recovery measures that require high levels of availability are typically hosted in virtual environments, so that transferring large amounts of data (e.g. when restoring a production environment database after an incident) can take place without any unexpected delays.

Modern traffic travels thousands of kilometres in milliseconds; however, latency can have an effect on connection quality and speed. If your business is looking for a remote cloud data centre in a different jurisdiction and in a different geographic location, considering nearby countries, like Eastern Europe and the Baltics, is the better way to do it. We recommend asking your potential data centre operator for some time to test the speed at which the channels can deliver data to your physical sites.

Back