Troubleshooting missing ACK in SIP

fotolia_52134519_subscription_monthly_m_plus-e1404839622138We all experienced calls getting self disconnected after 5-10 seconds – usually disconnected by the callee side via a BYE request – but a BYE which was not triggered by the party behind the phone, but by the SIP stack/layer itself.

This is one of the most common issues we get in SIP and one of the most annoying in the same time. But why it happens ?

Getting to the missing ACK

Such a decision to auto-terminate the call (beyond the end-user will and control) indicates an error in the SIP call setup. And because the call was somehow partially established (as both end-points were able to exchange media), we need to focus on the signalling that takes place after the 200 OK reply (when the call is accepted by the callee). So, what do we have between the 200 OK reply and the full call setup ? Well, it is the ACK requests – the caller acknowledgement for the received 200 OK.

And according to the RFC3261, any SIP device not receiving the ACK to its final 2xx reply has to disconnect the call by issuing a standard BYE request.

So, whenever you experience such 10 seconds disconnected calls, first thing to do is to do a SIP capture/trace and  to check if the callee end-device is actually getting an ACK. It is very, very import to check for ACK at the level of the callee end-device, and not at the level of caller of intermediary SIP proxies – the ACK may get lost anywhere on the path from caller to callee.

Tracing the lost ACK

In order to understand how and where the ACK gets lost, we need first to understand how the ACK is routed from caller to the callee’s end-device. Without getting into all the details, the ACK is routed back to callee based on the Record-Route and Contact headers received into the 200 OK reply. So, if the ACK is mis-routed, it is mainly because of wrong information in the 2oo OK.

The Record-Route headers (in the 200 OK) are less to blame, as they are inserted by the visited proxies and not changed by anyone else. Assuming that you do not have some really special scenarios with SIP proxies behind NATs, we can simply discard the possibility of having faulty Record-Routes.

So, the primary suspect is the Contact header in the 200 OK – this header is inserted by the callee’s end-device and it can be altered by any proxy in the middle – so there are any many opportunities to get corrupted. And this mainly happens due to wrong handling of NAT presence on end-user side  – yes, that’s it, a NATed callee device.

Common scenarios

No NAT handling

If the proxy does not properly handle NATed callee device, it will propagate into the 200 OK reply the private IP of the callee. And of course, this IP will be unusable when comes to routing back the ACK to the callee – the proxy will have the “impossible” mission to route to a private IP :). So, the ACK will get lost and call will get disconnected.

sip_flow_missing_ack_err1

If the case, with OpenSIPS, you will have to review your logic in the onreply route and perform fix_nated_contact() for the 200 OK, if callee is known as NATed.

The correct handling and flow has to be like this:

sip_flow_missing_ack_ok

Excessive NAT handling

While handling NATed end-points is good, you have to be careful not to over do it. If you see a private IP in the Contact header you should not automatically replace it with the source IP of the SIP packet. Or you should not do it for any incoming reply (like “let’s do it all the time, just to be sure”).

In a more complex scenarios where a call may visit multiple SIP proxies, the proxies may loose valuable routing information by doing excessive NAT traversal handling. Like in the scenario below, ProxyA is over doing it, by applying the NAT traversal logic also for calls coming from a proxy (ProxyB) and not only for replies coming from an end-point. By doing this, the IP coordinates of the callee will be lost from Contact header, as ProxyA has no direct visibility to callee (in terms of IP).

sip_flow_missing_ack_err2

In such a case, with OpenSIPS, you will have to review your logic in the onreply route and to be sure you perform fix_nated_contact() for the 200 OK only if the reply comes from an end-point and not from another proxy.

Conclusions

SIP is complicated and you have to pay attention to all the details, if you want to get it to work. Focusing only on routing the INVITE requests is not sufficient.

If you come across disconnected calls:

  1. get a SIP capture/trace and see if the ACK gets to the callee end-point
  2. if not, check the Contact header in the 200 OK – it must point all the time to the callee end-point (a public IP)
  3. if not, check the NAT traversal logic you have in the onreply routes – be sure you do the Contact fixing only when it is needed.

Shortly, be moderate, not too few and not too much …when comes to NAT handling 🙂

10 thoughts on “Troubleshooting missing ACK in SIP

  1. Hi very interesting and clear explanation. If in a network calls or not clearing, ie they appear to hang up by the pbx but stay connected when the phones are peer to peer. What would you think the cause? I know this is not opensips but help appreciated

    Like

    1. I would say the BYE requests (which are also in-dialog requests , as the ACK) are not properly routed between the end points – the possible causes for this are similar to the ones explained here for the ACK

      Like

      1. Thanks for your reply. Yes this is a strange scenario. It’s an NHS complicated network with NAT, separate Voice VLANs and a firewall between the gateway and the phones on the LAN. It’s good to get the confidence that you feel it’s this, as that’s my hunch too. I just do t have you kind of insight. And knowledge. I will now investigate this further.

        Like

  2. I’ve tried the fix_nated_contact() , got the correct IP now, but the carrier uses the phoneno (username) as a internal routing number so I’ve got a rubbish phoneno in the OK response. Is there anyway to rewrite the complete Contact header with correct user not only the IP ?

    Like

    1. Stefan, there is nothing like the correct username for the Contact URI – in SIP, the Contact URI is used to pin-point to an end-device as IP coordinates ; nothing more than protocol, IP address and port. Shortly, the username part of the Contact URI is not relevant for the SIP layer.
      No one should rely on the Contact URI username for routing (or for anything else).

      Like

  3. Very good explination, thanks, have hard time to understand the behaviour with ACK handling with one of our Telco operators. They change the contact hdr info after Proxy-Auth. trialing. So the 2:nd INVITE and finally “OK” holds a different public IP than in the Invite, So of course OpenSIPS sends the ACK to the Contact header IP wich doesn’t work (it is their rtp-media address !!!!) . They claim that I have to send the ACK back to the T-URI I got on the 200 ‘OK’ reply route !!!. (They use the 200 OK Contacte header for their own prefix routing inside their system.) Furthermore the TO header has their domain IP on it when calls come from their system to ours. Shouldn’t it be our domain as their dest-domain.
    What can I as a customer do, can I rewrite the Contact header on the 200 OK respons or do you have another more correct way to deal with this mess.

    Thank you in advance for your time and superb work.

    /Regards
    Stefan

    Like

    1. Stefan, the RFC3261 clearly say that the Contact header in the 2xx reply must point to the IP location of the callee (in terms of signaling, of course). If your carrier does not advertise their IP in the 200 OK Contact, it is wrong, against the SIP standard. And once again, in terms of IP routing, the only information to be used for ACK are the Contact and Record-Routes headers from 200 OK (what is also called the routing set).
      What you can do is to use the fix_nated_contact() function in onreply route, for the 200 OK you receive from the carrier – by doing that, OpenSIPS will overwrite the Contact URI with the source IP of the reply (so, the IP of the carrier).

      Like

      1. THANKS !!! YOU ARE MY HERO !!!!
        I’ve read the 13.2.2.4 in the RFC 3261 , but I was not sure that I understood i correctly.

        The UAC core MUST generate an ACK request for each 2xx received from
        the transaction layer. The header fields of the ACK are constructed
        in the same way as for any request sent within a dialog (see Section
        12) with the exception of the CSeq and the header fields related to
        authentication. The sequence number of the CSeq header field MUST be
        the same as the INVITE being acknowledged, but the CSeq method MUST
        be ACK. The ACK MUST contain the same credentials as the INVITE.

        Again, thank you soooooo much, and take care..

        //Stefan

        Like

Leave a comment