Clustering ongoing calls with OpenSIPS 2.4

istock_000010122617small-545x351Dialog replication in OpenSIPS has been around since version 1.10, when it became clear that sharing real-time data through a database is no longer feasible in a large VoIP platform. Further steps in this direction have been made in 2.2, with the advent of the clusterer module, which manages OpenSIPS instances and their inter-communication. But have we been able to achieve the objective of a true and complete solution for clustering dialog support? In this article we are going to look into the limitations of distributing ongoing calls in previous OpenSIPS versions and how we overcame them and added new possibilities in 2.4, based on the improved clustering engine.

Previous Limitations

Up until this point, distributing ongoing dialogs essentially only consisted in sharing the relevant internal information with all other OpenSIPS instances in the cluster. To optimize the communication, whenever a new dialog is created (and confirmed) or on existing one is updated (state changes etc.), a binary message about that particular dialog is broadcasted.

Limiting the data exchange to be driven by runtime events leaves an instance with no way of learning all the dialog information from the cluster when it boots up or at a particular moment in time. Consider what happens when we restart a backup OpenSIPS: any failover that we hope to be able to handle on that node will have to be delayed until it gets naturally in sync with the other node(s).

But the more painful repercussion of just sharing data without any other distributed logic is the lack of a mechanism to coordinate certain data-related actions between the cluster nodes. For example, in a typical High-Availability setup with an active-passive nodes configuration, although all dialogs are duplicated to the passive node, the following must be performed exactly once:

  • generate BYE requests and/or produce CDRs (Call Detail Records) upon dialog expiration;
  • send Re-Invite or OPTIONS pings to end-points;
  • send replication packets on dialog events;
  • update the dialog database (if it is still used as a failsafe for binary replication, e.g. both nodes crash).

Usage scenarios

Before actually diving into how OpenSIPS 2.4 solves the before mentioned issues, let’s first see the most popular scenarios we considered when designing the dialog clustering support:

  • Active – Backup setup for High Availability using Virtual IPs. The idea here would be to have a Virtual IP (or floating IP) facing the end-users. This IP will be automatically moved from a failed instance to a hot-backup server by tools like vrrpd, KeepaliveD, Heartbeat.
  • Active – Active setup, or a double cross Active-Backup. This is a more “creative” approach using two Virtual IPs, each server being active for one of them and backup for the other, and still sharing all the dialogs, in order to handle both VIPs when a server fails.
  • Anycast setup for Distributed calls (High Availability and Balancing). This relies on the newly add full support for Anycast introduced in OpenSIPS 2.4. You can find more details in the dedicated article.

Dialog Clustering with OpenSIPS 2.4

The new dialog clustering support in OpenSIPS 2.4 is addressing all the mentioned limitations by properly and fully addressing the typical clustering scenarios. But first let’s see which are the newly introduced concepts in OpenSIPS 2.4 when it comes to clustering dialogs.

Data synchronization

In order to address our first discussed issue, the improved clustering under-layer in OpenSIPS 2.4 offers the capability of synchronizing a freshly booted node with the complete data set from the cluster in a fast and transparent manner. This way, we can minimize the impact of restarting an OpenSIPS instance, or plugging a new node in the cluster on the fly, without needing any DB storage or having to accept the compromise of lost dialogs. We can also perform a sync at any time via an MI command, if for some reason the dialog data got desynchronized on a given instance.

Dialog ownership mechanism

The other big improvement that OpenSIPS 2.4 introduces for distributing dialogs is the capability to precisely decide which node in the cluster is responsible for a dialog – responsible in the way of triggering certain actions for that dialog. This comes as a necessity because some of the dialogs are locally created on an instance, some are temporarily handled in place of a failed/inactive node and others are just kept as backup. As such, the concept of dialog “ownership” was introduced.

The basic idea of this mechanism is that a single node in the dialog cluster (where all the calls are shared) is “responsible” at any time of a given dialog, in terms of taking actions for it. When the node owning the dialog goes down, another node become its owner and handle its actions.

But how is this ownership concept concretely implemented in OpenSIPS 2.4?

Sharing tags

In order to be able to establish an ownership relationship between the nodes and the dialog, we introduced the concept of tags or sharing tagsas we call them. Each dialog is marked with a single tag; on the other hand, a node is actively responsible for (owning) a tag (and indirectly all the dialogs marked with that tag). A tag may be present on several nodes, but only a single node sees the tag as active; the other nodes aware of that tag are seeing the tag in standby/backup mode.

So each node may be aware of multiple sharing tags, each with an active or backup state. Each tag can be defined with an implicit state at OpenSIPS startup or directly set at runtime and all this information is shared between the cluster nodes. When we set a sharing tag to active on certain node, we are practically setting that node to become the owner of all its known dialogs that are marked with that particular tag. At the same time, if another node was active for tag, it has to step down.

To better understand this, we will briefly describe how the sharing tags should be used in the previously mentioned scenarios, considering a simple two node cluster:

  1. in an active-backup cluster with a single VIP, we would only need a single sharing tag corresponding the VIP address; the node that holds the VIP will also have the VIP set to active and perform all the dialog related actions;
  2. in an active-active cluster with two VIPs, we would need two sharing tags, corresponding to each VIP, and whichever node holds the given VIP, should have the appropriate tag set as active;
  3. in an anycast cluster setup, we will have one sharing tag corresponding to each node (because the dialog is tied to the node it was first created as opposed to an IP). If a node is up, it should have its corresponding tag active, otherwise any node can take the tag over.

Configuration

Setting up dialog replication in OpenSIPS 2.4 is very easy and, in the following, we will exemplify our discussed scenarios with the essential configuration:

1. Active-backup setup

Let’s use the tag named “vip” which will be configured via the dlg_sharing_tag module parameter. When starting OpenSIPS, you need to check the HA status of the node (by inspecting the HA system) and to decide which node will start as owner of the tag:

modparam("dialog", "dlg_sharing_tag", "vip=active")

if active or :

modparam("dialog", "dlg_sharing_tag", "vip=backup")

if standby.

During runtime, depending on the change of the HA system, the tag may be moved (as active) to a different node by using MI commands (see following chapter).

At script level, all we need to do, on each node, is to mark a newly created dialog with the sharing tag, using the set_dlg_sharing_tag() function:

if (is_method("INVITE")) {
    create_dialog();
    set_dlg_sharing_tag("vip");
}

2. Active-active setup

Similar with the previous case, but we will use two tags, one for each VIP address.  We will define the initial tag state for the first VIP, on the first node:

modparam("dialog", "dlg_sharing_tag", "vip1=active")

The second node will initially be responsible for the second VIP, so on node id 2 we will set:

modparam("dialog", "dlg_sharing_tag", "vip2=active")

Now, on each node, depending on which VIP do we receive the initial Invite, we mark the dialog appropriately:

if (is_method("INVITE")) {
    create_dialog();
    if ($Ri == 10.0.0.1 # VIP 1)
       set_dlg_sharing_tag("vip1");
    else if ($Ri == 10.0.0.2 # VIP 2)
       set_dlg_sharing_tag("vip2");
}

So, calls established via the VIP1 address will be marked with the “vip1” tag and handled by the node having the “vip1” tag as active – this will be the node 1 in normal operation.

The calls established via the VIP2 address will be marked with the “vip2” tag and handled by the node having the “vip2” tag as active – this will be the node 2 in normal operation.

If the node 1 fails, the HA system will move the VIP1 as active on node 2. Further, the HA system is responsible to instruct OpenSIPS running on node 2 that it become the owner of tag “vip1” also, so node 2 will start to actively handle the calls marked with “vip1” also.

3. Anycast setup

Each node has its own corresponding tag and it starts with the tag as active. So on node 1 we will have:

modparam("dialog", "dlg_sharing_tag", "node_1=active")

And on the second node, the same as above, but with “node_2=active”.

Now, each node marks the dialogs with its own tag, for example on node 1:

if (is_method("INVITE")) {
 create_dialog();
 set_dlg_sharing_tag("node_1");
}

And, conversely, node 2 marks each created dialog with the “node_2” tag.

If node 1 fails, the monitoring system (also responsible for the Anycast management and BGP updates) will pick one of the remaining node in the anycast group and it will activate the “node_1” tag on it. So, this new node will became owner and responsible for the calls created on former node 1.

Changing sharing tags state

All that remains to be discussed is how can we take over the ownership of the dialogs flagged with a certain sharing tag at runtime. This is of course the case when our chosen mechanism of node availability detects that a node in the cluster is down, or when we do a manual switch-over (e.g. for maintenance). So for this purpose, all we have to do is to issue the MI command dlg_set_sharing_tag_active that sets a certain sharing tag to the active state. For example, in the single VIP scenario, with a sharing tag named “vip”, after we have re-pointed the floating IP to the current machine, we would run:

opensipsctl fifo dlg_set_sharing_tag_active vip

Conclusions

The new dialog clustering support in OpenSIPS 2.4 is a complete one as it not only takes care of dialog replication/sharing, but also of dialog handling in terms of properly triggering dialog-specific actions.

The implementation also tries to provide a consistent solution, by following and addressing the most used scenarios in terms of dialog clustering – these are real world scenarios answering to real world needs.

Even more, the work on the dialog clustering was consistently correlated with work on the Anycast support, so it will be an easy task for the user to build an integrated anycast setup taking care of both transaction and dialog layers.

Need more practical examples ? Join us to the OpenSIPS Summit 2018 in Amsterdam and see the Interactive Demos about the clustering support in OpenSIPS 2.4

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s