Health Check Failing (a tale of two modems)


#1

I have two Spectrum (Time Warner) modems - one is a business account Arris DG1670 and the other is a consumer account on a netgear CM600.

Health check failures happen on the DG1670 about 3-4 times per week, and never on the CM600.

I’m wondering if there is any way to debug this? My hunch is that the DG1670 is not actually losing connection, but rather there’s just a temporary DNS glitch. Or maybe it’s a bug in the DG1670?

Any tips/ideas/advice?


#2

Please refer to the link below for the explanation and troubleshooting steps.


#3

My HealthCheck is set to use DNS Lookup with the two IPs provided by our ISP. I also have “Include Public DNS servers” enabled. What is odd is that modem 2 has this exact same setup as well (and the same two DNS servers), but modem 2 never fails the health check.

Is there any way to get more detail about the failures, e.g. “ISP DNS failed at 12:01AM but Google DNS succeeded” or similar?


#4

Another weird behavior: The HealthCheck is set as follows:

  • 10 second timeout
  • 10 second interval
  • 10 retries
  • 1 recovery retry

By my calculations this means that there must be 10 consecutive DNS failures (which means over 100 seconds ) before the WAN is considered dead.

However, every single time this happens, the very next check succeeds, and the WAN comes back online.

Here are the last three examples:

Thu Mar 16 17:31:00 PDT 2017
WAN 1: Disconnected (Link down)
WAN 2: Connected (IP: x.x.x.x) 

Thu Mar 16 17:31:04 PDT 2017
WAN 1: Connected (IP:  x.x.x.x) 
WAN 2: Connected (IP:  x.x.x.x)  

Fri Mar 17 12:59:44 PDT 2017
WAN 1: Disconnected (Link down)
WAN 2: Connected (IP:  x.x.x.x)  

Fri Mar 17 12:59:48 PDT 2017
WAN 1: Connected (IP:  x.x.x.x) 
WAN 2: Connected (IP:  x.x.x.x) 

Sat Mar 18 01:20:50 PDT 2017
WAN 1: Disconnected (Link down)
WAN 2: Connected (IP:  x.x.x.x)

Sat Mar 18 01:20:54 PDT 2017
WAN 1: Connected (IP: x.x.x.x)
WAN 2: Connected (IP: x.x.x.x)

Notice that each time, the reconnection happens exactly 4 seconds later.

I find this highly suspicious: what are the odds that each time my WAN is dead, it dies for exactly 100 seconds and is back online at exactly 104 seconds.

This feels more like a bug to me - whether it’s in the Peplink, the modem, or the DNS server, I can’t say.

How can I debug this?


#5

Same issue. Upgraded firmware to 7.0.0 and now have health check issues with previously good DSL WAN. Upgraded Firmware because I had 1 wifi connection that would reboot the router whenever it connected, That issue this fixed but this one particular WAN connection is now useless.


#6

@soylentgreen

WAN health check failure should give you the following error log:
Mar 13 18:12:36 WAN: Maxis-185666 disconnected (WAN failed DNS test)

Base on the given logs, I don’t think this is related to the WAN health check issue.

Thu Mar 16 17:31:00 PDT 2017
WAN 1: Disconnected (Link down)
WAN 2: Connected (IP: x.x.x.x)

Thu Mar 16 17:31:04 PDT 2017
WAN 1: Connected (IP: x.x.x.x)
WAN 2: Connected (IP: x.x.x.x)

The above logs showing more to the “physical connection” issue for the connection between the Router and the modem.

Did you try before to set the following to isolating the issue:

  1. Static the router router WAN Port speeds
  2. Static the Modem port speeds
  3. Changing a new UTP cable for the connection between the router & modem

Thank you


#7

WAN health check is just a simple program sending the health packet and waiting the response from the health check target to determine the WAN conditions. WAN health check failure is always cause by health packets sent but did not get responded.

If you feel that the DSL WAN is good, you can simply isolate the issue by disable the WAN health check (Assuming Health check target defined is stable) . After disabling the WAN health check, you can verify the WAN connection from you laptop, and if you found internet intermittent for sometimes that show that the WAN is intermittent for sometimes and this cause the WAN health check failure.


Health Check failing after Firmware update
#8

And yet reverting back to 6.2.1 fixed it. I simply reverted back and have not lost connection once, no health issues, so although I agree that it’s just simple test, it’s obvious that it doesn’t function exactly as it does in firmware version 6.2.1, otherwise it would be expressing the same symptoms that I get with version 7.0.0.

I suppose I’ll deal with the other disconnect issue that happens in 6.2.1 which prompted me to update to 7.0.0. I can live with the router resetting when a remote laptop connects to wifi since it is only used for about 10 minutes of work once a day whereas the WAN health issue is constant… Would be nice if this expensive router worked without having to use duct tape and bailing wire while standing on one leg with tin foil on my head…


#9

Are you using DHCP to get your WAN address by chance? I can’t explain why the firmware revert fixed it, but if the DHCP lease time is the same as your failure interval, it may be a clue.

Are you using the PPOE settings in the Peplink or are you using the modem for the authentication of the link?


#10

WAN health check failure should give you the following error log:
Mar 13 18:12:36 WAN: Maxis-185666 disconnected (WAN failed DNS test)
Base on the given logs, I don’t think this is related to the WAN health check issue.

Aha - indeed when I look at the logs, the message I’m getting is

WAN 1 Disconnected (No cable detected)

So this could be a hardware issue. I will try swapping out the ethernet cable with a new one and see what happens.


#11

In answer to other questions:

This WAN is on a TWC (now Spectrum) business internet account and has 5 static IPs. The Peplink is set up to use the lowest IP address and the other 4 are added as “additional Public IPs”.

Here’s a list of all the disconnections over the past month or so - I’m not sure I can see any pattern there.

Mar 18 01:20:50	WAN: WAN 1 disconnected (No cable detected)
Mar 17 12:59:44	WAN: WAN 1 disconnected (No cable detected)
Mar 16 17:31:01	WAN: WAN 1 disconnected (No cable detected)
Mar 12 08:21:12	WAN: WAN 1 disconnected (No cable detected)
Mar 06 14:34:43	WAN: WAN 1 disconnected (No cable detected)
Mar 06 12:09:43	WAN: WAN 1 disconnected (No cable detected)
Feb 28 22:26:18	WAN: WAN 1 disconnected (No cable detected)
Feb 22 23:36:30	WAN: WAN 1 disconnected (No cable detected)
Feb 20 11:34:34	WAN: WAN 1 disconnected (No cable detected)
Feb 20 03:09:40	WAN: WAN 1 disconnected (No cable detected)
Feb 19 18:22:31	WAN: WAN 1 disconnected (No cable detected)

#12

Hi Soylent …

In my experience the “no cable detected” means just that – the ethernet path between modem and router has been disrupted (e.g., one pulls the ethernet cable out of router or modem). In each case we’ve tossed the Arrris modem in the “test later” bin and replaced it with another and the problem “goes away.” We’ve seen this on a number of Arris [Motorola] modems, and several DSL modems. Incidentally, our experience with the CM600 is good.

I, for one am quick to blame T-W/Spectrum/whatever they want to call themselves today, but in this case I’d check, in order (1) modem, (2) T-W, (3) ethernet cable, (4) Peplink. Just sayin’ … :smirk:

Rick


#13

My router will also say that when the modem is restarting.

what kind of power source is feeding the modem? Do you have any 1:1 NAT mappings for those additional IP addresses? Not that it should matter though. Did you also have the static IPs setup the same way with the older firmware?


#14

I checked the modem’s Event Logs http://192.168.100.1/cgi-bin/event_cgi and there seems no correlation between the No Cable Detected and anything going on in the modem. The modem hasn’t rebooted, and the only activity I see is shown below; however the dates & times don’t match up.

2/23/2017 22:53	82000200	3	No Ranging Response received - T3 time-out;
2/25/2017 18:28	68010300	4	DHCP RENEW WARNING - Field invalid in response v4 option;
2/25/2017 18:28	68010600	6	DHCP Renew - lease parameters tftp file-
3/1/2017 6:28	68010300	4	DHCP RENEW WARNING - Field invalid in response v4 option;
3/1/2017 6:28	68010600	6	DHCP Renew - lease parameters tftp file-
3/4/2017 8:42	82000200	3	No Ranging Response received - T3 time-out;
3/4/2017 18:28	68010300	4	DHCP RENEW WARNING - Field invalid in response v4 option;
3/4/2017 18:28	68010600	6	DHCP Renew - lease parameters time server-66.75.x.x
3/8/2017 6:28	68010300	4	DHCP RENEW WARNING - Field invalid in response v4 option;
3/8/2017 6:28	68010600	6	DHCP Renew - lease parameters time server-76.85.x.x;
3/11/2017 18:28	68010300	4	DHCP RENEW WARNING - Field invalid in response v4 option;
3/11/2017 18:28	68010600	6	DHCP Renew - lease parameters time server-76.85.x.x;tftp file-
3/13/2017 16:19	82000200	3	No Ranging Response received - T3 time-out;
3/15/2017 6:28	68010300	4	DHCP RENEW WARNING - Field invalid in response v4 option;
3/15/2017 6:28	68010600	6	DHCP Renew - lease parameters tftp file-
3/18/2017 18:28	68010300	4	DHCP RENEW WARNING - Field invalid in response v4 option;
3/18/2017 18:28	68010600	6	DHCP Renew - lease parameters time server-66.75.x.x;tftp file-
3/20/2017 17:07	82000200	3	No Ranging Response received - T3 time-out;
3/22/2017 6:28	68010300	4	DHCP RENEW WARNING - Field invalid in response v4 option;
3/22/2017 6:28	68010600	6	DHCP Renew - lease parameters time server-76.85.x.x;tftp file-

#15

So, do you have static IPs or dynamic IPs? You said that you have the IPs statically defined in the Peplink, but the modem logs indicate that it is using DHCP. Why would you be pulling information like tftp service location? Are they trying to push a modem firmware image to your modem or something?


#16

I believe that TWC/Spectrum configures their business class modems this way - the modem has a dynamic IP which it pulls via DHCP (which is not customer-facing) but also has Static IPs that are provided for the customer’s use. It’s weird looking but I think a pretty normal configuration.


#17

Does your modem put any log entries in the log when it reboots? From a bit of googling, it looks like those modems will reboot themselves if signal to noise ratios get to a certain level.

Have you tried swapping the two WAN links? Business account on WAN1, commercial on WAN2 swapped to business on WAN2 and commercial on WAN1? That would help identify where the problem is (Peplink or modem). If the problem starts occurring on the commercial link, then the Peplink may be at fault. If the issues continue on the business account - it is the modem or lines.

What happens if you connect a laptop directly to the business modem? Can you run a continuous ping to a site on the web without losing any packets?

I would start trimming the fat to find the error. (Eliminate devices from the path). As long as there are other devices, the cable company has an out (they just blame your gear)


#18

Agreed, FWIW.
I might suggest one additional step. A call to T-W tech support will tell you if the levels seen by the modem are within range. DOCSIS provides remote diag capabilities and they have the tools to assess this. That will help answer one important question. We’ve seen countless cases where, for example, one added a 2/1 splitter in the circuit in front of a cable modem (to add a TV, for example). The 3.5 - 4dB (or so) loss caused the signal to the modem to drop out of spec.


#19

Here are the stats - they are a tiny bit on the low side but not terrible - I believe the upstream level is supposed to be under 52 and I’m sitting about 50-51.

There is one splitter in the line which I could remove which might help. I’ll try that.

	DCID	Freq	Power	SNR	Modulation	Octets	Correcteds	Uncorrectables
Downstream 1	4	585.00 MHz	-5.00 dBmV	40.37 dB	256QAM	16928030252	106	0
Downstream 2	5	591.00 MHz	-5.00 dBmV	40.37 dB	256QAM	5655494268	107	0
Downstream 3	6	597.00 MHz	-4.80 dBmV	40.95 dB	256QAM	5693017493	135	0
Downstream 4	7	603.00 MHz	-4.60 dBmV	40.37 dB	256QAM	5694263911	133	7
Downstream 5	8	609.00 MHz	-4.80 dBmV	40.37 dB	256QAM	5666201180	157	0
Downstream 6	9	615.00 MHz	-5.00 dBmV	40.37 dB	256QAM	5859698728	155	0
Downstream 7	10	621.00 MHz	-4.80 dBmV	40.37 dB	256QAM	5652279815	148	0
Downstream 8	11	627.00 MHz	-4.90 dBmV	40.37 dB	256QAM	5828831770	143	0
Downstream 9	12	633.00 MHz	-5.20 dBmV	40.37 dB	256QAM	5796206643	142	0
Downstream 10	13	639.00 MHz	-5.40 dBmV	40.37 dB	256QAM	5841241525	131	0
Downstream 11	14	645.00 MHz	-5.30 dBmV	40.37 dB	256QAM	5725698858	142	0
Downstream 12	16	657.00 MHz	-5.80 dBmV	38.61 dB	256QAM	5708499888	319	0
Downstream 13	17	663.00 MHz	-5.70 dBmV	38.98 dB	256QAM	6050719546	301	0
Downstream 14	18	669.00 MHz	-5.80 dBmV	40.37 dB	256QAM	6162221901	386	0
Downstream 15	19	675.00 MHz	-5.90 dBmV	40.37 dB	256QAM	6097027424	483	9
Downstream 16	24	705.00 MHz	-5.70 dBmV	40.37 dB	256QAM	6731583047	704	110
Reset FEC Counters
Upstream

UCID	Freq	Power	Channel Type	Symbol Rate	Modulation
Upstream 1	50	23.30 MHz	49.25 dBmV	DOCSIS2.0 (ATDMA)	5120 kSym/s	64QAM
Upstream 2	52	37.00 MHz	51.00 dBmV	DOCSIS2.0 (ATDMA)	5120 kSym/s	64QAM
Upstream 3	51	30.60 MHz	50.50 dBmV	DOCSIS2.0 (ATDMA)	5120 kSym/s	64QAM
Upstream 4	49	18.50 MHz	48.75 dBmV	DOCSIS1.x (TDMA)	2560 kSym/s	16QAM

#20

Do you have the sample logs for the issue ?