Speedfusion ever-rising latency curve

breathevalue · July 28, 2014, 6:21am

I’ve got an issue that I don’t know how to debug …

I have a low-bandwidth high-latency link with a committed information rate of 120 kbps down and 96 kbps up on a separate dedicated channel. Latency when not congested is ~600-700ms.

Speedfusion really doesn’t work well over this link … The speedfusion link establishes fine – but performance is abysmal – with a single smtp session active via the speedfusion the protocol’s performance comes to a grind – speedfusion sees an ever-rising ‘latency’ measurement during this time … Meanwhile – non-speedfusion traffic sees the expected latencies (in the range of 600-1000 ms) …

Is there any way to diagnose this? How can I improve this situation …

Many thanks,

TK_Liew · July 29, 2014, 12:30pm

Hi,

WAN link with latency more than 1000ms is not recommended to build SpeedFusion tunnel. If the latency for non-SpeedFusion traffics are 600-1000ms, believe latency for SpeedFusion traffics are much more higher due to the overhead.

Latency is beyond Peplink control. You may need to consider to look for better WAN link.

Hope this help.

breathevalue · July 29, 2014, 6:40pm

Hi Liew,Many thanks for the response. Are you able to view the image I attached in the first post …? The forum isn’t showing it to me now and I"m not sure if it got corrupted somehow …?As for the WAN – its a satellite wan – without any traffic pressure, the latency it offers is around 630 ms – the fluctuations in latency measurements as measured with icmp pings is mostly the result of congestion of which there can be a lot. Its a tiny link with a lot of demand … The actual bandwidth is 512kbps link on a single channel shared with 4 other sites – the service provider implements some kind of qos at layer-3+ to give each site a CIR of 120kbps with bursts up to the maximum channel capacity – the burst capacity is essentially never there when the other sites are powered on – if they are up we have what appears to tcp connections as a 120kbps channel …It seems unusual to me that the latency measurements the speedfusion ui presents would rise higher and higher and higher like the picture shows though – if there is no speedfusion activity, the latency measurement is stable and comparable to the latency I would measure with an icmp ping. But if I start a transfer over the speedfusion link things appear to work for maybe a minute and then the plot shows latency on an ever increasing curve to absurdly high levels … I’m thinking that rising latency curve is essentially a representation of the size of a speedfusion buffer on one of the peplinks that is waiting to be transferred …? It could be the peplink’s speedfusion buffer is very large relative to our link capacity …? I would be curious to know how the latency measurement shown in the plot are made … the details of that might help us figure out if there are any changes we might be able to make to the isp’s qos to get the speedfusion traffic to behave better …For the traffic we move via speedfusion, we don’t need the most efficient possible transfers only to make sure that forward progress is made and that traffic is not being generated which can completely hog or oversaturate the link … I haven’t found any way to put a limit on the max data-rate that speedfusion will attempt through a given link - but it would be useful if I could …

TK_Liew · July 30, 2014, 12:24pm

Hi,

Thank for your info.

No I can’t see the attached image.

Believe this is a central site. As mentioned in your first post you had poor performance over SpeedFusion even single Smtp session. May I know how you measure this (single Smtp session)? Your mail server sit in central site?

Can you share what is the services using between central and remote site? Traffics behavior are mostly download from central to remote sites?

breathevalue · July 30, 2014, 2:08pm

Attaching the images again. The whole point of attaching the images is that they are a necessary part of the description of the issue I’m seeing …

The first image shows a period where there is initially no speedfusion activity, then a single smtp over speedfusion transaction is initiated (its a message send via smtp of an email with an attach, the bulk of that traffic flow will consist of data moved from the LAN to a speedfusion peer). Initially things look semi-normal, then the latency measurement shown on the speedfusion plot begins to climb into the stratosphere. It stays up there for a long while after the smtp transaction is cancelled. Eventually as shown in the second image, the latency goes back down to normal.

During the whole period, non-speedfusion traffic moved normally – ICMP ping packets during the period shown were in the range of 600-1000ms

(Note* – I can’t control the image order – the images are shown inline in the order opposite to that described)

TK_Liew · July 31, 2014, 10:51am

Hi,

Appreciate if you can open ticket at here.

breathevalue · August 2, 2014, 5:14am

I actually think I figured out the problem (or a portion of it) and found a suitable workaround.

In the course of investigating another issue, I stumbled across a change which also addressed this. I was seeing worse network performance than I expected when there was ‘heavy upload’ activity on the link (for such a small link this means a few active http file uploads was more than sufficient). Such activity was very strongly impacting the achieved network receive throughput – much more so than I would’ve expected over a full-duplex link. The issue ended up being the ‘upload bandwidth’ parameter on the peplink WAN – this parameter was configured to match the upload capacity of the tx channel of our link – 96 kbps. Apparently this parameter has the affect of setting a queue or buffer size or having the peplink enforce some traffic shaping/policing to that configured capacity – I think I will do some experiments to determine exactly what the behavior of that parameter is but that will have to be at a later time … Whatever the traffic queuing behavior on the peplink is – its not a good choice for our small link’s outbound traffic. With the outbound bottleneck at the peplink, it was very easy for active flows to have an over-stated advantage over less active flows and essentially starve tcp acks and new tcp connections.

I instead configured the ‘upload bandwidth’ on the link to be significantly greater than the actual channel capacity – this has the affect of moving the outbound traffic bottleneck forward to the edge gateway device. The edge gateway device for this link is a cisco 2811 router – by default a cisco router uses flow based weighted fair queuing as the queueing mechanism for all links with less than 2Mbps bandwidth. This queueing algorithm has the effect of causing flows to quickly converge on a rough split of the channel capacity while avoiding the ability for an active flow to starve out new connections and less active flows (ack responses and new tcp connections). Its the default for links with less than 2Mbps capacity because its behavior is well-suited to very low bandwidth links.

This change solved the ‘upload activity overly slows network’ issue and also had the effect of solving the issue of ever-rising speedfusion latency. It seems that speedfusion is now able to find the correct packet flow rate and continue forward progress in a more appropriate manner … The latency value graphed in the speedfusion performance ui still rises when there is an active speedfusion flow – but it stops around 5-6000ms and stays there while the flow is active rather than rising towards infinity as it did before …

Let me know if you still want me to open a ticket.

TK_Liew · August 3, 2014, 3:57pm

breathevalue:

I actually think I figured out the problem (or a portion of it) and found a suitable workaround.

In the course of investigating another issue, I stumbled across a change which also addressed this. I was seeing worse network performance than I expected when there was ‘heavy upload’ activity on the link (for such a small link this means a few active http file uploads was more than sufficient). Such activity was very strongly impacting the achieved network receive throughput – much more so than I would’ve expected over a full-duplex link. The issue ended up being the ‘upload bandwidth’ parameter on the peplink WAN – this parameter was configured to match the upload capacity of the tx channel of our link – 96 kbps. Apparently this parameter has the affect of setting a queue or buffer size or having the peplink enforce some traffic shaping/policing to that configured capacity – I think I will do some experiments to determine exactly what the behavior of that parameter is but that will have to be at a later time … Whatever the traffic queuing behavior on the peplink is – its not a good choice for our small link’s outbound traffic. With the outbound bottleneck at the peplink, it was very easy for active flows to have an over-stated advantage over less active flows and essentially starve tcp acks and new tcp connections.

I instead configured the ‘upload bandwidth’ on the link to be significantly greater than the actual channel capacity – this has the affect of moving the outbound traffic bottleneck forward to the edge gateway device. The edge gateway device for this link is a cisco 2811 router – by default a cisco router uses flow based weighted fair queuing as the queueing mechanism for all links with less than 2Mbps bandwidth. This queueing algorithm has the effect of causing flows to quickly converge on a rough split of the channel capacity while avoiding the ability for an active flow to starve out new connections and less active flows (ack responses and new tcp connections). Its the default for links with less than 2Mbps capacity because its behavior is well-suited to very low bandwidth links.

This change solved the ‘upload activity overly slows network’ issue and also had the effect of solving the issue of ever-rising speedfusion latency. It seems that speedfusion is now able to find the correct packet flow rate and continue forward progress in a more appropriate manner … The latency value graphed in the speedfusion performance ui still rises when there is an active speedfusion flow – but it stops around 5-6000ms and stays there while the flow is active rather than rising towards infinity as it did before …

Let me know if you still want me to open a ticket.

Hi Ben,

Upload bandwidth specifies the data bandwidth in the outbound direction from the LAN through the WAN interface. This value is referenced when default weight is chosen for outbound traffic and traffic prioritization. Thus setting upload 96Kbps will affect the upload traffic for SpeedFusion as well.

Also satellite could be doing tcp/icmp optimization but not udp (SpeedFusion using udp protocol for data). This could be one of the factor causing the high latency. You may open ticket if your situation getting worse in future.