
The clustering support in OpenSIPS is a challenging area, under continuous evolution. Even after years since its initial version, we still discover challenging scenarios, which need to be understood and addressed. And production environments, especially the ones involving large amounts of data, are typical melting pots for such challenges. Below are some of our finds from the last few months.
Conflict Between Cluster Sync and Live SIP Traffic
The sync mechanism of OpenSIPS Cluster ensures a freshly booted OpenSIPS node has the same dataset (e.g. SIP endpoints) as the rest of the nodes within the same cluster. This is useful, for example, in HA scenarios, where you want a hot backup box readily available to take over the SIP traffic processing should the active box ever catch fire.
Here we discovered that, in rare occasions, if you restarted an OpenSIPS node while it is handling live SIP traffic, the cluster sync operation would sometimes cause it to get stuck because of a conflicting SIP REGISTER being handled at the same time!
While only a seldom occurrence, this scenario implied a temporary service outage without proper monitoring to forcefully restart the faulty OpenSIPS node. The issue was addressed shortly after the report, so let’s move ahead.
Incomplete Sync of Large Data on Startup
If you are running a clustered OpenSIPS v3.4.8 or older and also have lots of SIP endpoints registering into the cluster, you might have noticed the occasional incomplete sync after restarting any of your instances. For example, the donor OpenSIPS node has 25000 registered endpoints, but the freshly restarted instance sometimes only pulls approx. 9000 endpoints, before declaring itself as “synced”. Afterwards, the sysadmin runs a manual ul_cluster_sync MI command and all 25000 registrations are correctly pulled this time. But why did the initial sync abort midway?
In short, the problem was rooted in the seed_fallback_interval setting of the clusterer module, which was not taking into account real-life startup delays such as loading your routing rules from the database, your dialplan, dispatcher destinations or similar caching operations, which may each take a few seconds to complete.
A frequently-occurring scenario, emphasized by the need for large-data replication (such as registrations), causing temporary loss of data on the newly booted node. Once we understood the issue, fixing it was just a formality.
Cluster Breakage During Large Data Sync
Finally, we addressed another report where the OpenSIPS cluster would temporarily split during a large data sync, then rejoin once again. It turned out that the more data you sync’ed without increasing your cluster’s ping_timeout, the more likely you were to break the cluster during the sync, which would also abort the sync.
A clever couple of changes later and this report had also been addressed, without requiring sysadmins to increase the ping_timeout the more SIP endpoints there are, anymore.
This was a very disruptive, highly probable scenario. Imagine — each node restart (even a backup) or manual MI “sync” command, would often break the consistency of the OpenSIPS cluster for a few seconds, and the sync data might not even fully arrive anymore. Again, all of this is in conjunction with large amounts of data to be replicated.
Conclusion
The latest round of OpenSIPS stable minor releases from Dec 18th, namely 3.5.3 and 3.4.10, includes a series of critical stability improvements to the clusterer module when working with large amounts of data, among many other fixes in other parts of OpenSIPS.
Packaging for these releases is already available via both APT and YUM repositories, while the latest source code can be downloaded via GitHub, as always.
But how useful are these improvements to you?
