Hunting Down Complex OpenSIPS Bugs in Production Environments

opensips-code

Three key principles have guided the development of OpenSIPS since its very beginnings: performance, flexibility and modularity. The resulting codebase and feature set all reflect this statement: multiple process management, shared memory, async I/O, module APIs, optimized code, etc. While these are trivial to use and benefit from once everything is nice and working, not the same can be said about the efforts required to extend and/or troubleshoot the underlying code.

How did we get here?

With a lot of good spirit and hard work, of course! Over the course of the years, as SIP adoption has become more widespread, its usage cases have become more complex. New SIP extension drafts have emerged, while existing ones have been either revised or obsoleted. During the same time, there has been a constant push to integrate OpenSIPS with a broad range of related VoIP software/hardware. All of this, in turn, has resulted into a constantly growing complexity of the OpenSIPS C code behind SIP transactions, dialogs, accounting, presence, etc. What used to be a handful of SIP testing scenarios has now reached numbers well into the hundreds!

But this linearly growing number of supported SIP scenarios/drafts is not the core problem – after all, a good suite of system tests should be enough to confirm the stability of a release. The problem is when you pass the tests, yet you fail to catch that subtle concurrency bug that only manifests at high call volumes and is close to impossible to replicate in an artificial SIP traffic environment (aka. good ol’ sipp testing). Needless to say that the bug will constantly trigger in production, leading to a steady stream of emails – all of them eagerly requesting a quick fix.

Ok, so let’s recap the problem statement:

we want to track down the event history of a C structure stored in shared memory, manipulated by multiple processes
the runtime environment is highly concurrent: tenths of cores, thousands of CPS, production-level SIP traffic
we cannot run OpenSIPS in debug mode
we are dealing with a class of bugs which produce thwarted corefile backtraces (showing the effect of the bug, not its cause)

A Quick Solution

The naive solution would be to sprinkle some INFO log level debugging statements (Tip: INFO is one level above DEBUG, in terms of verbosity) throughout the code one intends to troubleshoot, in critical areas. Although it may definitely help resolve some of the crashes, here is where this solution fails to deliver:

not all environments will support logging on such a scale. Imagine having to push logs for all the transitions / references happening around each transaction, dialog or TCP connection when running 2000 CPS through OpenSIPS. rsyslogd would just jump into 100% CPU usage and hog the entire system.
as it is, this patch would not make it into upstream OpenSIPS. Although INFO-level logging is right above DEBUG, it isn’t that verbose and many deployments make use of it in production. We cannot just pollute INFO-level logging with C structure debugging statements. We’d also prefer not to do the following, as it litters the codebase:
```
#ifdef DBG_TCPCON
LM_INFO("tcp_conn_get, (%d)\n", c->refcnt);
#endif
```

The Definitive Solution

Now that we have a better idea for what the requirements actually are, here is a breakdown of the features / design decisions that make up what we now call “struct history support”:

possibility to perform in-memory only logging
- this immediately requires a gliding window mechanism (e.g. “the history logs of only the last 100K SIP transactions”), since one cannot indefinitely dump logs to memory, as opposed to syslog.
possibility to log struct events to syslog
- also the possibility to easily match and walk through the logs of one structure, although its log lines may be intertwined with others
possibility to completely disable the history logging statements at compile time out of the code that is using them
ensure that this code does not get compiled by default default into the OpenSIPS binary (only by enabling -DDBG_STRUCT_HIST in Makefile.conf)

With the above in mind, this is the resulting C API (more info under lib/dbg/struct_hist.h):

struct struct_hist_list *shl_init(char *obj_name, int window_size);

void shl_destroy(struct struct_hist_list *shl);

struct struct_hist *sh_push(void *obj, struct struct_hist_list *list);

void sh_unref(struct struct_hist *sh, struct struct_hist_list *list);

int sh_log(struct struct_hist *sh, enum struct_hist_verb verb, char *fmt, ...);

Action Time! How We Troubleshooted a TCP Engine Bug

First, let me quickly describe the problem: a reference counting bug was causing a TCP connection to be freed while another OpenSIPS process was making use of it. Needless to say that the resulting gdb backtrace was hardly of any use, showing little more than a process trying to modify what it considered to be a TCP connection, but which was actually a freshly allocated SIP message!

This is the compromise of using shared memory with a custom memory allocator: performance goes up, but so do the hours you spend debugging memory corruption bugs if you don’t put in the time to also write proper debugging tooling to go along with the allocator.

Getting back to our problem, now that we had the struct history support, we were only left to add a series of sh_log() statements in critical areas (TCP connect, accept, read, write, pass to reader, pass to writer, back to main, etc.), after which we deployed straight to production. A few hours later, the crash occurred, along with the much-needed connection tracking logs, which resembled the following:

$7540 = {verb = TCP_REF, t = 48360500000, pid = 49219, log = 0x7f95acf6ea60 "connect, (1)"}
$7541 = {verb = TCP_UNREF, t = 48360500000, pid = 49219, log = 0x7f95acfb8300 "send 6, (1)"}
$7542 = {verb = TCP_REF, t = 48360600000, pid = 49313, log = 0x7f95acf8e4a0 "tcpconn read, (1)"}
$7543 = {verb = TCP_SEND2CHILD, t = 48360600000, pid = 49313, log = 0x7f95acf8e4f0 "read"}
...
$8186 = {verb = TCP_REF, t = 51580700000, pid = 49276, log = 0x7f95ad0e7be0 "tcp_conn_get, (2)"}
$8187 = {verb = TCP_UNREF, t = 51580700000, pid = 49276, log = 0x7f95ad0e7c30 "send 6, (2)"}
$8188 = {verb = TCP_REF, t = 51590600000, pid = 49236, log = 0x7f95ad0bace8 "tcp_conn_get, (2)"}
$8189 = {verb = TCP_UNREF, t = 51590600000, pid = 49236, log = 0x7f95ad0bad38 "send 6, (2)"}
$8190 = {verb = TCP_REF, t = 51601000000, pid = 49307, log = 0x7f95ad0e79f8 "tcp_conn_get, (2)"}
$8191 = {verb = TCP_UNREF, t = 51601000000, pid = 49307, log = 0x7f95ad0d8700 "send 6, (2)"}
$8192 = {verb = TCP_UNREF, t = 51601000000, pid = 49313, log = 0x7f95ad0d8750 "tcpworker async write, (2)"}
$8193 = {verb = TCP_REF, t = 51601000000, pid = 49313, log = 0x7f95ad0e6988 "tcpconn write, (2)"}
$8194 = {verb = TCP_SEND2CHILD, t = 51601000000, pid = 49313, log = 0x7f95ad0e69d8 "write"}
$8195 = {verb = TCP_UNREF, t = 51601200000, pid = 49313, log = 0x7f95ad093e30 "tcpworker destroy, (1)"}
$8196 = {verb = TCP_DESTROY, t = 51601200000, pid = 49313, log = 0x7f95ad093e80 "type=2"}

This info was extracted using a gdb “while” loop

Shortly after, the bug was spotted and a fix was available within an hour. A few days later, we backported it to the 2.3 and 2.2 stable branches, just in time before we put together a new set of stable minor releases for the OpenSIPS community. The struct history support is only available on the master branch (future OpenSIPS 2.4), but it’s trivial to integrate into an older release if required.

To conclude, I can definitely say that there are lots of pieces of OpenSIPS code which could benefit from this lightweight production-ready troubleshooting support: accounting contexts, transactions, dialogs, presence entities, TLS connections, etc. (virtually any short-lived structure). We now have the ability to put each of them under the microscope and carefully backtrack them even under the most unfriendly conditions of high-volume SIP traffic.

Hunting Down Complex OpenSIPS Bugs in Production Environments

How did we get here?

A Quick Solution

The Definitive Solution

Action Time! How We Troubleshooted a TCP Engine Bug

Published by liviuchircu

Leave a comment Cancel reply

How did we get here?

A Quick Solution

The Definitive Solution

Action Time! How We Troubleshooted a TCP Engine Bug

Share this:

Related

Published by liviuchircu

Leave a comment Cancel reply