High Availability (HA) is applicable to the entire IT stack. General knowledge in networking, hardware, software, system architecture, application and services design, databases, VPNs, firewalls, disaster recovery, cloud, virtualization, and more is required.
This document covers how to make Iguana highly available. Undoubtedly, you will pull in the technologies and techniques you are familiar with and Interfaceware are interested in adapting your technologies as necessary as needed.
To better understand Iguana HA strategies and design, Here we look at some key concepts and terminology:
Application, Interface, or Service: The entity which a user expects to use, or a web service client expects to consume. It may be made up of many modules or components, but it has a single entry point. At the network layer an application or service is the IP and port combination on which it’s served. Both definitions are relevant. These three terms are synonyms for the most part and they’re used interchangeably here.
Active-Active: A pair of servers, both servicing requests — a cluster of two. In Iguana, From HTTP channels can be active-active, but from LLP HL7 feeds usually use persistent connections — it does not make sense to use active-active in this case. Active-active is for very high volume, so Polling channels also don’t make sense. Pollers fetch their own data, so there’d be race conditions in regards to when which Poller processed the data. The statements regarding clusters also apply to active-active.
Active-Passive: A pair of servers, one active, one on standby. For when applications are not suitable for clustering. Like a database server. When Iguana is operated in a “traditional” way (persistent connections, queue data is critical), active-passive is more suitable.
Load Balancer: A server or appliance which sits in front of a cluster or active-active servers. Used for horizontally scaling loads well beyond what we expect to see in typical healthcare. The load balancer is configured to handle health-checking the members and removing/adding them from the cluster. We use the HAProxy load-balancer for failover only, which allows the Iguanas to run on any platform. A primary reason for using HAProxys is that it’s easily dropped into existing networks.
Fault Tolerance: (1) A system/service does not go down if a single component/dependency fails — key components are redundant. Downtime is minimal — operations may be impeded but there is no outage. (2) Replicated physical components/specialized hardware, e.g., a secondary CPU on a motherboard prevents component failure in the first place. Costly, but capable of true zero-downtime.
High Availability: Cost-efficient hardware, software, and shared resources, cooperating to guarantee the availability or a service or application. Highly available systems are composed of fault-tolerant (first definition) system components. The goal is not to zero-downtime but quick recovery, so as to mask the failure from users.
Disaster Recovery: Unlike High availability (HA), disaster recovery (DR) focuses on re-establishing IT services including components such as infrastructure, telecommunications, systems, applications and data at an alternate site following a disruption of IT services. Disaster recovery is geographic diversified and it is not just redundancy at the system or datacenter level. It focuses on re-establishing services after an incident and it is not just fail over. It addresses multiple failures in a datacenter/site while high availability addresses a single predictable failure. Lastly, disaster recovery includes the people and processes necessary to execute recovery while high availability focuses on technology design and implementation.
Understand Iguana High Availability [top]
Server Failures vs Channel Failures: Iguana HA solution handles server failures, NOT channel failures. Channel failure is handled like it always has been. There is an option in Iguana do decide if a channel should stop on error or skip the message. Known conditions are handled in the code, and a decision is made if a bad message should stop the channel. The important thing is that it’s bad data or a bug causing the failure. If a channel failed from these kinds of things and then an HA failover occurred to another copy of the channel, the backup channel would fail on the exact same condition.
Iguana HA failover only occurs at the server level. If a server vanishes or is shutdown then the other is promoted. When a channel is designed, we plan how failover should work if an Iguana service fails and how the backup should pick up processing again (e.g. halfway through a batch processing job), but we don’t use failover for error handling.
Server Failover vs Service Failover: like other integration engines, Iguana is a middleware application. On the network, an application is represented by an IP and port number pair. The IP is the host the service runs on and the port is which process on the host the service runs at. So each From LLP/HTTP is an application on the network and any code which runs as a result of messages coming in on any From LLP/HTTP must be taken into consideration.
HA is about making an application highly available, yet Iguana is not a single application. It’s one-to-many applications — an application managing many applications — each varying in degrees of complexity.
That means that the configuration, complexity, and planning required to make Iguana highly available grows as interfaces are added. It’s an ongoing process as well — every time a new interface is added, failover behaviours, planning, and testing should take place. For this reason it’s helpful to conceptually separate infrastructure component failover from service failover.
HA is ensuring that the service is available. Using industry-proven tools, we can ensure that an Iguana service is always listening at a certain IP address, ready to process new messages. This is what any adequate HA solution does. Here are some common demands from our clients:
Instant Failover: This is one of the top requests HA features. Instant failover focus on automatic failover seamlessly in order to keep Iguana highly available (a.k.a uptime). To achieve instant failover, both Iguana primary and backup need to be running. When Iguana primary become unavailable, the load balancer will detect the changes and route incoming requests to Iguana backup. This detect and failover process is expecting to happen instantly.
Automated Queue Recovery or “Guaranteed” Message: Automated recovery of the Iguana queue is a much requested feature. This is actually data integrity and recovery, not HA. Usually some form of replication is used between the primary and backup, either using a third-party tool or a native feature. For example, MySQL has built in replication and Rabbitmq has a queue mirroring feature.
Iguana does not currently have a built in features like queue mirroring and any third-party solutions are platform-specific and a goal of Iguana HA was to remain platform independent. To achieve queue recovery, we can use shared storage and an Iguana channel called the Instance Manager.
Guaranteed Message Sequence: Another requested HA Iguana feature is, the order in which the messages are sent to Iguana is maintained, and they are guaranteed to be processed in that order. There is some downtime — about 10 seconds plus the time it takes Iguana to start up and possible rebuild its indexes.
Iguana High Availability Solutions [top]
The Instant Fail-over HA design focus on the backup Iguana instant failover instantly when primary Iguana fails. Iguana is installed on each cluster/VM server with individual configurations and logs. There is no Shared Storage. When the newly active node failed, the standby node will take over by switching the Virtual IP (VIP). See diagram below:
LB1 and LB2: Load balancers. LB1 is primary and LB2 is backup
WK1 and Wk2: Iguanas workers. WK1 is primary and WK2 is backup
Iguana HA on the Cloud
Iguana HA support both AWS and Azure cloud solutions. Here some cloud specific technologies we use for Iguana HA:
- Amazon EC2: is a web service that provides secure, resizable compute capacity in the cloud. Dedicated Instances (ex. HIPAA Compliance) are available.
- Regions and Zones: Amazon EC2 is hosted in multiple locations worldwide. These locations are composed of regions and Availability Zones.
- VPC and Subnet: A virtual private cloud (VPC) is a virtual network dedicated to your AWS account. It is logically isolated from other virtual networks in the AWS Cloud. You can add one or more subnets (ex. IPv4 CIDR block) in each Availability Zone.
- EBS/EFS Encryption: Elastic Block Store (EBS) / Elastic File System (EFS) seamlessly offers data encryption with HIPAA Compliance.
- Elastic IP and VIP: Elastic IP address can associate with any instance or network interface for any VPC in your account. On the AWS cloud, the Elastic IP can act as a Virtual IP (VIP) to be reassigned to the secondary server during failover.
- Internet Gateway, Router and NAT Gateway: are part of horizontally scaled, redundant, and highly available VPC components for Amazon EC2 instances internal communication.
Elastic Load Balancing: automatically distributes incoming LLP or HTTP messages to Iguana workers
- Virtual Machine: is on-demand, scalable computing resources that gives the flexibility of virtualization without having to buy and maintain the physical hardware that runs it. Dedicated VM (ex. HIPAA Compliance) are available.
- Regions and Zones: Azure is hosted in multiple locations worldwide. These locations are composed of regions and Zones.
- VNet and Subnet: An Azure Virtual Network (VNet) is a representation of your own network in the cloud. It is a logical isolation of the Azure cloud dedicated to your subscription.
- Disk: An Azure managed disk is a virtual hard disk (VHD). Azure managed disks offers data encryption with HIPAA Compliance.
- Public IP: Public IP address can associate with any virtual machines or load balancer. Public IP can act as Virtual IP (VIP).
- Internet Gateway, RouteTable: are part of horizontally scaled, redundant, and highly available VNet components for Azure virtual machines internal communication
- Azure Load Balancer: automatically distributes incoming LLP or HTTP messages to Iguana workers
High Level Architecture Design
Since Azure and AWS have different cloud technologies, the actual design and implementation will be different. Here is the high level Iguana HA on the cloud design: