SIP ALG (SIp Passthrough) and Outbound Policies

ngefvert · March 25, 2016, 1:20am

It would appear as though when you are using SIP ALG (Service Passthrough > SIP > Standard Mode), Outbound Policies for WAN connection preference and Outbound NAT mappings are not obeyed for SIP traffic? I am wondering if anyone else has seen this?

We have 3 WAN connections, and for all traffic originating from our SIP server to the outside world, we have the Outbound Algorithm set to priority, with the following order specified (from highest to lowest):

WAN 2
WAN 1
WAN 3

When SIP traffic goes out of our network, it will apparently randomly select a WAN interface. We had all SIP traffic go out over WAN 3 for a few days, then it switched to going out over WAN 1. WAN 2 is up, and other traffic is going out over it - in fact, any traffic from the SIP server OTHER than SIP traffic goes out over WAN 2, like it should. We did have SIP traffic go our over WAN 2 previously, but after a few days it switched to another WAN connection just like the behavior described above.

I also tried changing the policy to enforced, and selected WAN 2, and traffic is still going out over WAN 1.

NAT mappings also appear to be ignored for SIP traffic with SIP ALG turned on. Originally we wanted SIP traffic to go out over specific additional IP’s in our IP blocks on our 3 WAN connections, but regardless of NAT Mappings programmed in, SIP traffic would always go out via the interface IP. We figured that SIP ALG was re-writing packet headers and putting in the interface IP, so we decided to remove the NAT mappings and just allow the SIP traffic to go out via interface IP’s.

This was fine, but later on we suddenly started seeing SIP traffic leaving WAN 2 going out one of the other IP’s in that IP block, instead of the interface IP address. Even worse, this was not the IP address we originally had in the NAT mapping for that connection, which already been deleted at this point - it was another random IP in that block.

We ended up putting NAT mappings back for SIP traffic to just use the interface IP’s for each WAN connection. That worked for a while, until we noticed the above issue with Outbound WAN priority not being followed, and were advised by our support company to remove the NAT mappings, as they may be interfering with the Outbound connection priorities…

So now we are back at just having outbound traffic priority set for SIP traffic, and it not being followed. No NAT mappings.

Is anyone else using SIP ALG and seeing this same issue? Regardless of whether the Peplink is rewriting SIP headers, if I tell it to route SIP traffic out over WAN 2 and WAN 2 is up, traffic should go out over WAN 2.

Don_Ferrario · March 25, 2016, 9:46am

We run a similar setup with SIP and multiple WAN sources, and have none of these problems. You mentioned that you tried Priority and Enforced algorithms, but did not say how in the rule you are identifying the the traffic. You should be able to specify ports 5060-5064 UDP and 10,000-20,000 UDP on Priority to a specific WAN (that takes two outbound rules). While that should work, I also add a route sending all outbound traffic to your VoIP provider’s IP (or IP range) through that same WAN. If your traffic is not going out as intended, you haven’t designed the route correctly.

NAT mapping won’t help you if the traffic goes out through the wrong WAN. NAT mapping only selects which IP on the WAN, not which WAN. Unlikely you’ll need NAT mapping for VoIP.

How are you authenticating with your VoIP provider for outbound traffic? If you are using IP only authentication, then you must stay with a specific IP (unless your provider allows you to specify multiple source IPs). If you are using username/password authentication, theoretically you should have no trouble with load balancing switching WANs.

Inbound traffic is a different question. Again if you are using IP authentication for inbound, then your VoIP provider is controlling which if your IPs gets the traffic. The router can’t control the inbound source. If you are using username/password for inbound authentication, then the connection starts on your end so yes the router would control it.

At least during testing, when you create the outbound rule, be sure to check “terminate session on link recovery.” If you don’t, the rule won’t change the traffic of a connection that is already open so nothing will change on the Active Sessions page until you break the connection for some other reason. Don’t forget to push the refresh button on the Active Sessions page if you are checking this activity.

There area lot of possible mistakes to make. I’ve made all of those errors and more. I fought with VoIP problems for years until switching to Peplink routers. You’ve got the right tool.

ngefvert · March 27, 2016, 10:44pm

We are using an outbound policy with the source being the IP of the SIP server - it just keeps everything simple. Nothing else set, so all traffic from that server should go out over the specified WAN connections - which it does - with the exception of 5060 SIP and 10000-20000 RTP - all of that goes out whatever random WAN connection the Peplink picks at the given time. We have also tried setting the source to any and the destination to the IP’s of our VOIP providers, with the thought that perhaps SIP ALG rewriting the packet headers was somehow changing what the Peplink saw as the ‘source IP’ to the WAN IP, and it was not matching on the internal IP in the rule - still the same result though. Here is the original rule:

Correct, NAT mapping alone does not do that, but if you have blocks of IP’s from each provider and wish to use specific IP’s within those blocks for each WAN connection, you do a combination of an outbound route and NAT mapping. We did this for years with Sonicwalls. For example, if you have WAN1, WAN2, WAN3, and the following IP’s on each:

WAN1:
1.1.1.1
1.1.1.2
1.1.1.3

WAN2:
2.2.2.1
2.2.2.2
2.2.2.3

WAN3:
3.3.3.1
3.3.3.2
3.3.3.3

So say I want SMTP traffic to be able to utilize the WAN connections in numeric order, and on each one I want the traffic to go out as the .3 address for each. I set up an outbound policy to utilize those WAN connections in the desired order, then I use NAT mappings to specify which IP to use for that traffic on each WAN connection depending on which one is in use. This works as it should for all other traffic, including the example of SMTP. Where it does not work is for SIP 5060 and RTP 10000-20000.

With your successful setup, are you using SIP ALG? That is the only thing I can think would be causing problems like this. Unfortunately we need that turned on - Peplink’s successful implementation of SIP ALG versus the broken version of it on the Sonicwalls is specifically why we moved to them. Also, what firmware are you on. We are on 6.3.1 build 2256. When deployed a few weeks ago it was a RC, but I think there may be newer RC’s out at this point.

We are using IP based authentication for outbound traffic, no registration, and one of the providers only allows outbound calls from IP’s that are destinations for inbound calls - hence it being desirable to set the outbound IP for SIP traffic.

Inbound is no issue, that is fine because like you said, the provider chooses the IP to send the call to (in the priority we provide).

I will try that - I thought holding the sessions open was a problem, so in troubleshooting I had disabled the other 2 connections altogether to force traffic over what should be the primary WAN for that traffic, but after bringing them back up, the SIP traffic would start going back out over one of the other connections that had been disconnected. What we want to be the primary connection is the most reliable, and is never down, whereas the other 2 may go down for a few seconds once every 2 weeks or so - so if anything that SIP traffic should be stuck going out what we want to be the primary connection for that traffic.

TK_Liew · March 28, 2016, 12:07pm

Hi ngefvert.

Enable/Disable SIP ALG will not overwrite the routes you defined in Outbound Policy.

Can you share how you noticed SIP traffic was going into WAN1 and WAN3? Possible to provide screenshot on this?
Possible to share the screeshot of your Outbound Policy (Network > Outbound Policy > Rules)?
I do agree with Don. Please enable Terminate session on link recovery to see any different.

Thank you.

ngefvert · March 28, 2016, 11:01pm

These are logs from our SIP providers. Our CDR reports show the source IP for every outbound call we make. For example, I just checked one provider and of 81 calls out in the last 24 hours, all of them went out over WAN 1, which is the secondary priority for that outbound policy. WAN 2, which is the highest priority for that policy, has been up that whole time. I can provide a screenshot but because we present customer ANI’s and the destination numbers are confidential, i’d have to scrub a lot of it.

So for example, I just made a test call out via Flowroute. That call went to sip.flowroute.com, and so should go out via NETCARRIER, which is set to enforced for that policy (so terminate session link on recovery does not come into play). Their logs show the call coming from a COMFIBER IP. Even if for some reason the Peplink did not match that traffic as going to sip.flowroute.com, then since the traffic was originating from the .219 address in the rule below, it should have gone out over NETCARRIER anyways.

For the 81 calls in the last 24 hours I mentioned above, they would have gone out over a different SIP provider, but originated from the .219 address, so that rule should have routed that over NETCARRIER as well, but it went out over COMFIBER.

The 3 Adtrans see the same behavior where they would route over the VZFIOS instead of their primary of COMFIBER, but the mailservers and outbound test machine (first rule) work properly.

I just set this to ON for the FreePBXtoWorld rule, and the SIP traffic appears to be routing properly - but that stirs up some more questions:

‘Terminate session on link recovery’ is not an option for the ‘Enforced’ rule for the Flowroute traffic, but enabling it for the rule below it also affected that. So even though that enforced rule was above the other one, it wasn’t being obeyed because of an established session matching that lower rule?
We have never used this function before (Sonicwall has a similar feature for outbound load balancing) to try to prevent the following case: Primary WAN goes down and so outbound traffic fails to the secondary one. Call is established during that time, and then the primary WAN comes back up, so now that call is interrupted as the Peplink starts routing traffic out over the primary WAN again, and the carrier sees RTP packets coming from an IP that didn’t establish that call. Or will the there be another SIP handshake to say ‘now my RTP is coming from this new IP’?
I had thought that it was a problem with the established sessions in the Peplink not expiring and keeping sessions going over the lower priority WAN’s, so one of the things I had tried was disabling the other WAN connections for 10 minutes. Traffic routed over NETCARRIER properly during that time, but then as soon as I enabled them again SIP traffic would resume going out over them… That was with this identical setup, so I have no idea why this was occurring.
Is it the low qualify frequency on our trunks keeping the sessions open? I guess if I set the qualify frequency to at least 2 times the UDP session timeout in the Peplink (what is that by the way) then during low traffic periods overnight the sessions would terminate, and I may not need to worry about the ‘Terminate session on link recovery’ option, correct?

Thanks for all the help so far…

TK_Liew · March 29, 2016, 2:49pm

Hi ngefvert,

The rule FreePBXtoWorld was created recently for troubleshooting purpose or it has been there all the while?

Actually I suspect the problem is mixture with 2 reasons below. I assume rule FreePBXtoWorld was there previously.

sip.flowroute.com was not resolved correctly

How system knows the IP of the defined domain in Outbound Policy? This is based on peroidically DNS query from LAN clients.
We will keep the resolved domain and IP in our database for certain timeframe.
Question:-
192.168.1.219 will communicate with sip.flowroute.com peroidically?
Only 192.168.1.219 will communicate with sip.flowroute.com from LAN?
What is the DNS server for 192.168.1.219?
sip.flowroute.com was cached by DNS server?

Session for SIP not fall back to WAN2 after WAN2 was up.

By using default settings of Priority Algorithm, SIP traffic will failover to secondary WAN if primary WAN was down. Traffic will be remain in secondary WAN even WAN1 was up.
By enabled Terminate session on link recovery, system will forced the SIP traffic back to WAN2 when it was up.

Suggestion:-

Please use IP instead of sip.flowroute.com.
enable Terminate session on link recovery if you are using Priority Algorithm.
If problem persists after you follow the suggestion 1 and 2 above, please open ticket for us to do further checking.

Thank you.

ngefvert · March 29, 2016, 11:36pm

Yes, the FreePBXtoWorld rule has always been there - the DNS based matching for Flowroute was added for troubleshooting purposes because I wanted to see if using an enforced rule would force the traffic out the correct WAN.

.219’s DNS servers are a primary and secondary domain controller on the LAN. .219 is resolving the name properly for sip.flowroute.com because the call is getting there. Only .219 will communicate with Flowroute - so I would have put the source for that policy as .219, but since that was not working for the below rule I left it wide open, so ANY traffic out of the Peplink to flowroute would match the rule. I wanted to rule out the polices not matching the source properly. We are qualifying all of our trunks at the default asterisk qualify frequency of 60 seconds (which is why the sessions would be kept open without terminate session on link recovery in an outage). The WAN connections in the Peplink each have their own DNS servers for use in rules but I am not sure where I would see that it resolved the address correctly. I have never used DNS based rules in a firewall before - I am not sure if the Peplink looks up that IP when the rule is created then caches it, how often it re-resolves it, etc. - was just trying it for troubleshooting since I was not sure how often Flowroute changes their host IP’s. The help tip in the Peplink says nothing about how it resolves that. I have removed that DNS-based rule since then, since I have zeroed in on the problem being the terminate session on link recovery option.

I understand what Terminate session on link recovery does, and also that with it off, an established session will continue to go out the WAN connection it is on.

What I do not understand is why disabling the secondary WAN the sessions are established on does not force them back over to the preferred WAN in the outbound policy.

What we would like to happen is as such - traffic goes out over the following WAN’s in this order:

WAN1
WAN2
WAN3

If WAN1 starts bouncing for some reason, traffic fails over to WAN2 (WAN1 is the most reliable connection but hey - internet issues happen). Because I have ‘Terminate session on link recovery’ the traffic stays on WAN2 indefinitely, since the qualify frequency for the trunks is keeping those sessions open on WAN2. Since I have alerting for WAN failures out of the Peplink, I know this has occurred. I can monitor WAN1, and when it no longer has issues, I can - at a low volume time of my choosing - disable WAN2, forcing traffic back to WAN1. Unfortunately, this does not work properly.

What actually happens with ‘Terminate session on link recovery’ DISABLED and WAN1 experiences outages:

WAN1 fails, traffic fails over to WAN2. WAN1 comes back up, but the sessions stay open on WAN2. I disable WAN2, and traffic fails over to WAN3. I disabled WAN3, and traffic finally fails over to WAN1. I re-enable WAN2 and WAN3, and traffic goes back to either WAN2 and WAN3. This is reproducible and does not seem to be the appropriate behavior. I am leaving them disabled for 10 minutes, which is hopefully longer than the UDP session timeout in the Peplink.

What happens with ‘Terminate session on link recovery’ ENABLED - and WAN1 experiences outages:

WAN1 fails, traffic fails over to WAN2. WAN1 comes back up, WAN2 sessions are terminated and go back to WAN1. This whole cycle may repeat 10 times, and if calls are trying to process during that time, that is a lot of extra disruption to calls from the flopping back and forth.

The preferred method is to have the failover occur once, then manually push back over to the primary WAN once the dust settles - this does not work properly though.

How is this accomplished?

TK_Liew · March 31, 2016, 4:54pm

ngefvert:

Yes, the FreePBXtoWorld rule has always been there - the DNS based matching for Flowroute was added for troubleshooting purposes because I wanted to see if using an enforced rule would force the traffic out the correct WAN.

.219’s DNS servers are a primary and secondary domain controller on the LAN. .219 is resolving the name properly for sip.flowroute.com because the call is getting there. Only .219 will communicate with Flowroute - so I would have put the source for that policy as .219, but since that was not working for the below rule I left it wide open, so ANY traffic out of the Peplink to flowroute would match the rule. I wanted to rule out the polices not matching the source properly. We are qualifying all of our trunks at the default asterisk qualify frequency of 60 seconds (which is why the sessions would be kept open without terminate session on link recovery in an outage). The WAN connections in the Peplink each have their own DNS servers for use in rules but I am not sure where I would see that it resolved the address correctly. I have never used DNS based rules in a firewall before - I am not sure if the Peplink looks up that IP when the rule is created then caches it, how often it re-resolves it, etc. - was just trying it for troubleshooting since I was not sure how often Flowroute changes their host IP’s. The help tip in the Peplink says nothing about how it resolves that. I have removed that DNS-based rule since then, since I have zeroed in on the problem being the terminate session on link recovery option.

I understand what Terminate session on link recovery does, and also that with it off, an established session will continue to go out the WAN connection it is on.

What I do not understand is why disabling the secondary WAN the sessions are established on does not force them back over to the preferred WAN in the outbound policy.

What we would like to happen is as such - traffic goes out over the following WAN’s in this order:

WAN1
WAN2
WAN3

If WAN1 starts bouncing for some reason, traffic fails over to WAN2 (WAN1 is the most reliable connection but hey - internet issues happen). Because I have ‘Terminate session on link recovery’ the traffic stays on WAN2 indefinitely, since the qualify frequency for the trunks is keeping those sessions open on WAN2. Since I have alerting for WAN failures out of the Peplink, I know this has occurred. I can monitor WAN1, and when it no longer has issues, I can - at a low volume time of my choosing - disable WAN2, forcing traffic back to WAN1. Unfortunately, this does not work properly.

What actually happens with ‘Terminate session on link recovery’ DISABLED and WAN1 experiences outages:

WAN1 fails, traffic fails over to WAN2. WAN1 comes back up, but the sessions stay open on WAN2. I disable WAN2, and traffic fails over to WAN3. I disabled WAN3, and traffic finally fails over to WAN1. I re-enable WAN2 and WAN3, and traffic goes back to either WAN2 and WAN3. This is reproducible and does not seem to be the appropriate behavior. I am leaving them disabled for 10 minutes, which is hopefully longer than the UDP session timeout in the Peplink.

What happens with ‘Terminate session on link recovery’ ENABLED - and WAN1 experiences outages:

WAN1 fails, traffic fails over to WAN2. WAN1 comes back up, WAN2 sessions are terminated and go back to WAN1. This whole cycle may repeat 10 times, and if calls are trying to process during that time, that is a lot of extra disruption to calls from the flopping back and forth.

The preferred method is to have the failover occur once, then manually push back over to the primary WAN once the dust settles - this does not work properly though.

How is this accomplished?

Hi,

The problem you are facing is strange. We can’t replicate this in our lab. Please help to open ticket for us to investigate.

Thank you.

Don_Ferrario · April 3, 2016, 1:28pm

I think you are onto this already (some of above is confusing) but I would not use the domain name of the destination to identify the outbound rule. Yes it should work but if your DNS server burps, your rule is broken. That could be the mysterious source of why your rule works sometimes but not always. Just ping the VoIP provider and see what IP replies, and use that IP for a priority route.

Attached are my rules and a screen shot of my outbound sessions. 198.42.231.135 is the LAN of the Asterisk server. Like you I have a rule that says anything from that LAN IP goes out a particular WAN. Thats’ probably enough on its own. As insurance, I also have rules that anything going to the VoIP provider (Vitelity) goes out that same WAN, and I did that using the VoIP provider’s numeric IP, not their domain name. Notice my rule includes their entire /24 subnet so if they start using another IP, I’m still covered. Its not going to hurt me if some other traffic to that subnet gets forced out that WAN. Finally I have everything port 5060-5064 but with the above rules I don’t think that is necessary.

We have another VoIP provider set up as backup, and in the sessions image you can see the connection to them is also going out the correct WAN.

The Terminate Sessions button is vital. Yes that means if your primary WAN drops, calls move to backup, when primary WAN restores any calls in progress will break. The alternative is that you won’t switch back to primary until there are no calls in progress, which for most businesses means you’ll spend the rest of the day on backup. Not good.

The rules screen shot shows the Apply Changes button lit up because I dragged those rules to the top for you. My outbound rule page is pretty long.