ISP DNS Redirect to Cache causes Google in-accessible

One customer site started to suffer from a weird issue recently. They have 2x ISP providers (ISP-1 & ISP-2). When both ISP connections are operational, Google sites (www.google.com, translate.google.com, youtube.com, etc) start to become un-available. However, the moment we disconnect any one of the two ISPs, those sites are accessible again. We investigated the issue and engaged the ISP technical support, then we found out that the issue is related to ISP caching solution. Each ISP has hosted a Google Cache Setup. The ISP detects the source IP of DNS requests requesting any of Google services, and it responds back with IP addresses that directs that client towards the best Cache Server (according to the client source IP address). Moreover, each ISP will accept connections to its Cache Setup coming from its customers only. It will deny access to its Cache Service once it detects the source IP address is coming from a Foreign Provider. This created an issue, since the client requesting DNS name resolution of www.google.com, may originate from ISP-1 (and hence get the Cache Service address suitable for ISP-1 clients), however, the real connection setup request towards that IP address (returned from the DNS request) might be routed over ISP-2 (thus the source address of this connection establishment setup will be seen as originating from ISP-2 range, but destined to ISP-1 caching service, which will block it). I was thinking of a solution. Temporarily, I resolved www.google.com from both ISPs, and created a rule to enforce connections to these IPs to go over the proper ISP. However, the problem is really broader, as we later found out that some images within Facebook personal pages do not get displayed (obviously these are also being chached by the ISP). We have requested from each ISP to bypass the cache for our traffic, but that request was denied. So, I thought to communicate this issue to you to get help or any idea to overcome this issue.
The way out of this as I see it would be that the Peplink tracks DNS replies/responses incoming over each ISP connection (maybe in a table in its memory), and dynamically implement routing of outbound connection requests to go over that ISP connection which passed the DNS reply corresponding to the Destination Address in the connection request (maybe as an 8th routing algorithm).

Hello,

For your unique situation I think implementing a Weighted Balance rule with Persistence option would solve the issue. As it will still distribute traffic evenly through both WANs and ensuring that data sessions are staying on a particular WAN when initiated. Regarding the images not always being there, I would ensure “auto” is selected for MTU size in the WAN settings and re-test.

I did not get how the persistence rule will help, since I have to make sure that the client issuing dns request goes over the same isp as that which will later carry the real connection traffic. Also, challenge is that all computers are configured with local DNS server which acts on their behalf to resolve internet names to ip addresses

Hi,

  1. If I configure 8.8.8.8 and 8.8.4.4 as DNS servers in WAN1 and WAN2, users can browse Google domains (Let assume we can achieve DNS resolving and browsing with same WAN)?

  2. May I know Local DNS Proxy (Network > Network Settings > DNS Proxy Settings) is needed in your environment?

Even when we configure Google dns servers on both wan (8.8.8.8, 8.8.4.4), it looks like it is always that each isp will result in dns resolving to their cache
We have enabled dns proxy under lan configuration

Hi,

Look like the ISPs will intercept DNS query based on your feedback. I do have a suggestion below. do let me know whether this makes sense.

  1. Disable Local DNS Proxy
  • Network > Network Settings > DNS Proxy Settings > Uncheck Enable.
  1. Assign public DNS server to DHCP clients
  • Network > Network Settings > DHCP Server > DNS Servers > Uncheck Assign DNS server automatically > Enter public DNS for DNS Server 1 & 2
  1. Add a Persistence Outbound Rule (Network > Outbound Policy > Rules)
  • Source = Any
  • Destination = Any
  • Protocol = Any
  • Algorithm = Persistence
  • Persistence Mode = By Source
  • Load Distribution = Auto

By having settings above, all internet traffics (DNS query and internet browsing) will go through the same WAN link from the same client. For example, all internet traffic from client A will stick to ISP A and all internet traffic from client B will stick to ISP B.

Hope this help.

Thank you TK,
Your suggested logic makes sense. I think it will work this way. The only problem is that, all internal client PCs are pointed to the Active Directory DNS for local domain hostname resolution. I think if we replace this AD DNS with Public DNS, the clients will no more be able to resolve local domain hosts. Moreover, what the ISP told us it was caching, I doubt it is for traffic interception for monitoring to comply with recently (2-weeks back) published cyber security law of the country. I am mentioning this here, for you to get ready for similar cases in this region

Hi,

I see. We left last option then. You need to know all related domains for google.com, translate.google.com, youtube.com before using this option. Fyi, 2 domains (google.com, gstatic.com) will loaded back-end if you browse www.google.com. You may check this with Wireshark.

If you know all related domains, please Outbound Policy below:-

  1. Add Outbound Policy for domain google.com, gstatic.com
  • Source = Any
  • Destination = google.com
  • Protocol = Any
  • Algorithm = Priority
  • Priority Order = WAN1, WAN2
  • Repeat the rule above for the rest of the related domains.

Hope this help.

@Mohamed_Sabbah did you find a solution to this problem? We seem to be facing a similar issue.

In our case, we:

  • disabled the DNS Proxy and DNS Cache in the Peplink
  • used OpenDNS servers for the entire network and all 4 WANs (each of which is a different ISP)
  • created persistence rules for all key services (http, https, imap, etc) based on Source

However, our problem remains that each computer retains a DNS cache and over time, some website access becomes very slow and in some cases pages do not load. As soon as we perform a DNS flush command on the client computer, the reliability and speed are back. We believe the DNS for a prior session on a previous WAN/ISP connection was cached on the computer and the client uses that “stale” info in newer sessions on different WAN/ISP connections – thereby resulting in irregular performance and reliability.

While we could prevent all client computers from caching DNS info, that will likely introduce it’s own bottlenecks. I was wondering if you found a solution for your specific situation.

Thanks.

I’m handling your ticket regarding to the issue. By reading the above explanation, can you confirm disabling DNS caching for the client computers solve the issue ? If yes, then i will suspect this may related to certain DNS return from the single ISP is invalid for the others ISP over the times. This may related to the ISP level doing traffics re-route or DNS proxied or google Caching (mention by Mohamed_Sabbah) ISP level. Do you check before with ISP level that you have the same ISP caching as mention by @Mohamed_Sabbah ?

Just curios the issue only happen for UAE country with the extra things that ISP applied ? This is rather strange/unique as we don’t see this in others country.

Do you tried before the suggestion by @TK_Liew here ? Did that work for your issue ?

If we map all google domains (although the problem we face extends to many other domains) to a specific ISP/WAN, then we do not have the problem.

Your description is exactly what we have been trying to communicate since we initially reported this issue. And again, flushing the cache on the local computer immediately makes sites that were inaccessible or especially slow work correctly again.

For your reference, we are in Kuwait (not UAE), but I am sure that each ISP likely uses it’s own traffic re-routing.

For now, we have setup a DNS caching timeout of 10 minutes on 3 of our local machines as a test. We will update you tomorrow.

Thank you for your assistance.

Do keep us updated :+1::+1::+1:

So far, it seems like the users that had their machines set to only keep a local DNS cache for 10 minutes are reporting noticeably better reliability and performance.