I think I am seeing the behavior described in the title.
We have a Balance 310 with 3 WAN connections It is version 8.0.0 build 4203.
Our SIP provider will route calls to the 3 IP addresses used for our PBX in the following order:
- WAN 2 IP address
- WAN 1 IP address
- WAN 3 IP address
We are using the SIP service passthrough in standard mode to accommodate NAT of that SIP traffic on those IP’s through to the PBX. We also have a ‘priority’ algorithm outbound policy set up for traffic from the PBX to the SIP provider with the same connection priority shown above for the SIP trunk priority, to make sure that we are sending outbound SIP signalling over the appropriate connection.
This has worked for years. In the event of WAN 2 failure, calls flow in over WAN 1 then WAN 3. (WAN 2 is the highest priority for this provider because it has the lowest latency to their RTP servers)
Previously, in the event of a ‘bouncing’ connection due to inconsistent health status failures, we would manually disconnect WAN 2 from the dashboard until the issue was resolved, forcing SIP traffic over to WAN 1.
We just had this scenario happen for the first time since upgrading to the current firmware version, and calls would not complete. What we appear to be seeing is SIP responses still being received over WAN 2, and routed to the PBX. The PBX responds, but the Peplink SIP ALG doesn’t do anything with those packets, because it would want to change the PBX IP in the response headers to the WAN 2 public IP and send it out that connection, which is disabled.
After a number of invites with no response, the SIP provider sends invites to the 2nd and 3rd IP’s in their trunk routing priority. These make it through to the PBX, but because the Call ID being used is the same as the active INVITE the PBX was trying to respond to when it received the INVITE over WAN 2, it sees it as a looping issue and responds with 482.
To add to the confusion, if you physically disconnect WAN 2’s network cable OR create a firewall rule blocking SIP traffic on WAN 2, failover happens properly. It is only disconnecting the WAN on the dashboard that produces this behavior.
When this was occurring, we simply changed the IP priority in the main provider’s trunk routing profile, but that’s not ideal because we have multiple SIP providers, along with other services coming in over WAN 2, so it is a lot simpler to just kill the connection in one place until the internet provider resolves the issue. Disconnecting the physical cable isn’t ideal either since someone isn’t always on site. The firewall rule is a workaround, but doesn’t automatically take care of the outbound priority, so you would need to remember to both disconnect the connection AND turn on a pre-programmed firewall rule.
Since this did work many times previously (normally our operators don’t even know anything is down until they dispatch the connection alerts from the firewall to tech staff), it would seem like the behavior was introduced with some firmware revision? Has anyone seen this before or aware of why it would do this?
Here is a chart of the packets captured at the Peplink for a test SIP call with WAN 2 disabled: