Odd Email Delay Issue Behind a Balance Two

VoIP_Route · December 8, 2022, 6:44pm

I am having the craziest problem and I am completely stumped on what the actual root cause of the problem is and therefor how to fix it. Here is long history and explanation of the issue:

I took over a site that was a flat network with an old sonicwall, old netgear managed switch, 5 old aruba AP’s and some SNAP managed switches for AV items. The location has 2 Hotwire 1 GIG Fiber circuits. HW1 wasn’t being used and the other HW2 was used for all the internet service. The users mostly have newer Apple ipads and iphones with a few Dell laptops, a Dell desktop and the rest is all IOT devices like Crestron, SONOS, TV’s, Thermostats etc.

I designed a simple set of equipment upgrades keeping the flat network. I installed a Peplink Balance 2, a few Netgear managed POE switches and 5 Ubiquiti U6-LR AP’s. This was all installed and completed while the owners were away for 2 months. I put all the new equipment on the previously unused HW1 1 GIG fiber circuit. Before I moved it over I tested that it was running over 900 M with no issues. I left the Sonicwall and the old network fully connected with a single Dell desktop on that seperate network on HW2 service. Of course I tested everything and it all worked just as I expected. Speeds were fast both hardwired and wireless, no CRC’s, no packet loss, no dNS issues every system was working fine.

Here’s where the problem starts:

The owners came back and noticed that their ipads and iphones had terrible delays opening email. They use a hosted exchange through Intermedia. I couldn’t actually replicate the issue on my own devices but I 100% confirmed the issue on theirs. They would try to open an email and it would be blank for as much as 10 or more seconds and this wasn’t one email it was most emails. I did also notice that the Dell desktop when wired had no issue but when on WiFi had the same problem. Also when the switched their devices from WiFI to cellular they had NO delays at all.

With that being the case I thought it was the WiFi so started working on heat mapping channels and frequencies to make some adjustments. It didn’t help so I started working with Ubiquiti and they stated the following:

We reviewed the logs and found that the client devices are facing tcp_latency.

anomalies=ip_timeout dhcp=r
anomalies=tcp_latency

tcp latency occurs when some clients over consumes the internet bandwidth provided by your ISP. To overcome this situation we recommend enabling QoS features on your router.

The problem I had with their suggestion is the fact that the network is loafing and they do not come even close to pushing this 1 GIG pipe. They then suggested I replace the switch and or router. With all that said I tried the following:

Made some QOS changes and tried some Bufferbloat settings - NO CHANGE
Replaced the switches - NO CHANGE
Turned off all but 1 AP’s to reduce possible RF issues - NO CHANGE
Swapped out a Ubiquiti AP for a new WiFi 6 Grandstream AP - NO CHANGE
Put back the original Aruba and Netgear switch behind the new Balance 2 - NO CHANGE
PUT THE ARUBA AP’s BEHIND THE ORIGINAL NETGEAR SWITCH AND SONICWALL (which is also on the other hotwire fiber circuit) - FIXED THE PROBLEM

I can’t fathom how this could be a router issue but I am going there again on Monday to try another router and to also test and swap the Hotwire fiber circuits HW1 & HW2. I really don’t know what else to do but I will take ANY suggestions or advice I can get on this. And it you read all of this THANKS

jmjones · December 9, 2022, 2:05am

Have you done any testing with dns settings? A slow dns server can cause what you are seeing. It can be especially bad when there are a lot of cnames and aliases - most often used by cloud providers.

VoIP_Route · December 9, 2022, 3:47am

@jmjones thanks VERY much for responding. I usually set my routers with either opendns or cloudflare and set the router as DNS server with caching.

I never really ran any DNS timing tests but I can easily change that remotely and see if if makes a difference before Monday.

The interesting thing about your suggestion is I know for sure that the sonicwall on HW2 where everything works is using hotwire default DNS servers

Rick-DC · December 9, 2022, 11:43am

If DNS slowness is a potential issue I’d recommending testing with a tool such as this one. Having aid that, I can’t imagine that the ones you’ve chosen are an issue. (Glad to see you are not using your carrier’s DNS.) Are these inquiries being made via TCP as the error messages suggest? (Sorry, I am not familiar with the Ubiquity “language.”)

Good testing so far. I’m wondering:

If you might have done any packet captures.
Are you certain you don’t have a spanning tree-type error in there somewhere, perhaps with the newly-added switches? (If so, the slowness to resolve may not be noticed via the IOT/M2M devices.)
If the “problem clients” remain connected does the delay in mail fetch occur each and every time?
Is DHCP configured correctly as seen from the clients which are problems?
What do ping times to the mail host look like via ethernet and wi-fi? (I’d test by unresolved name as well as IP address.)

(Side note: I wish you were “local” so I could hand you a Peplink AP to try. A beautiful future of these is that they are controlled by the router.)

VoIP_Route · December 9, 2022, 1:35pm

@Rick-DC Thanks for the input on this.

I used to use the GRC DNS Benchmark but pretty much stopped after I started using OpenDNS and Cloudflare DNS. I will use it on Monday when I am there testing both the Hotwire circuits HW1 & HW2.

Are these inquiries being made via TCP as the error messages suggest?
Interesting question and I never thought to ask Ubiquiti but I don’t see how DHCP could be using TCP. Also they are pretty much just flat out horrible to work with for support and for each question I ask it takes a minimum of 2 to 5 days to get a response and usually the responses are vague and not very helpful . So after 30 days or working with them I have little to show for it.

If you might have done any packet captures.
I know I should have done this already but ignorantly I haven’t. I will do some on Monday

Are you certain you don’t have a spanning tree-type error
I don’t see any errors at all in the Netgear switches but I can’t see anything in the SNAP AV switches. The SNAP AV switches have always been in the network and when I removed the new Netgear switches for testing I put the original switch back so everything was the same except for the Peplink and the HW1 internet service… yet the problem persisted.

If the “problem clients” remain connected does the delay in mail fetch occur each and every time?
This answer is a bit tricky. The owner that has this issue is VERY IMPATIENT and almost impossible to troubleshoot with. He gives me a few minutes to see his problem, doesn’t let me touch his ipad or iphone and then pretty much just says go fix it. He does state that it is a constant problem and he has been using his cell service and not wifi in order to function.
When I return on Monday I am going to obviously be asking him to confirm the problem as I make changes but I am also going to be testing using the Dell which did seem to replicate the issue while on WiFi. I also setup a mailbox using the same intermedia Exchange server on a clean perfectly running Windows 10 laptop and I will test with that too as I connect to each of the AP’s.

Is DHCP configured correctly as seen from the clients which are problems?
I never even thought to check that! What I have done is try both setting a reservation for the two MAC’s and not setting it. (Every known device is on a reservation in defined ranges in the DHCP scope. Only unknown devices get an IP that is general and not defined) I see the IP’s in the router but never checked that on the clients. I will check it when I am there Monday

What do ping times to the mail host look like via ethernet and wi-fi?
Another good point that I didn’t specifically check. I will definitely do it Monday

Funny you mention about the Peplink AP… I almost bought one just to help troubleshoot this darn problem. The only reason I didn’t is because I had a brand new Grandstream AP I got as a demo and figured I could try that to get some answers. Frankly I was shocked when it didn’t resolve the issue because I was so convinced it was a Ubiquiti AP problem.

Rick-DC · December 9, 2022, 2:28pm

Well, this is a fascinating problem =-- and I know hugely frustrating for you. I’d like to follow this. Please keep us informed. My gut is that this is not a DNS issue as we learn more about this – although @jmjones 's initial guess was good.

Peplink AP’s: The APs really “shouldn’t matter.” The primary reason we use them is because the Balance (and some MAX) routers contain AP controllers and this makes a lot of things easy. One thing we can do with Peplink APs, for example, is invoke RA, remote assistance, and let the “smart guys” take a peek from afar. But having said that, this does not smell to me like an AP issue.

VoIP_Route · December 9, 2022, 3:16pm

@Rick-DC I will definitely update this thread with everything I find on Monday THANKS