Help with Layer 2 VPN

SotYPL · March 21, 2017, 7:42am

So I have two Balance One routers with SpeedFusion licenses. Each is placed at different site and have 2 WAN links (t1 and cable). Right now they are connected trough standard Layer 3 VPN but I’m wondering if what I’m trying achieve is even possible.

Balance One (Site 1)
LAN IP: 172.22.60.1
mask: 255.255.255.0
DHCP: 172.22.60.2-172.22.60.254

Balance One (Site 2)
LAN IP: 172.22.59.1
mask: 255.255.255.0
DHCP: 172.22.59.2-172.22.59.254

Most of my servers are located on Site 1 and are VMs (ESXi). I’m also using Zerto software for replication and currently both hosts (production and DR) are located on Site 1 and have IP addresses from 172.22.60.0/24 subnet. I want to move my DR host to Site 2. In case of some emergency when my production hosts goes down I need to start VMs from DR host but they all will start up with IP addresses from 172.22.60.0/24 subnet and gateway set to 172.22.60.1. If I understand it correctly Layer 2 VPN would be right choice but I would have either change IP addresses for all devices on Site 2 to 172.22.60.0/24 also or change subnet on both sites to /23 but then all internet traffic from Site 2 would have to go trough VPN and then gateway on Site 1, is that correct? I need to avoid that because I don’t have enough bandwidth available (50/10 and 10/10 for Site 1 and 16/2 and 1.5/1.5 for Site 2). Is there any other way to achieve what I need? Thanks.

jmjones · March 21, 2017, 8:08am

I have heard that there is a function in Zerto to change the IP address of the VM when it is recovered at another site.

from my experience - make a separate vm for testing. Set up your recovery site and add the testing vm to it. Run a complete restoration process. Zerto requires several ports to be open to completely restore. I had a VM get crapped out by a rogue SCCM server and Zerto would restore to 99% and then fail because of a connectivity problem on one port used for finalizing the restore. I ended up having to rebuild the server.

SotYPL · March 21, 2017, 8:19am

The link that you provided is about changing IP address of VRA (Virtual Replication Appliance) and not actual guest OS for VM that you are replicating. And as far as I know Zerto can only change IP address of Windows VM and can’t do it for Linux VM but that could be archived by doing some scripting inside guest OS. But changing IP address of guest OS is only part of the problem. I would also have to change my DNS entries and then clear DNS cache on all devices that need to access VM I will start from DR host. So I need to avoid that if I want to have real low RTO that Zerto gives me.

jmjones · March 21, 2017, 8:03pm

Layer 2 should work for you. You might want to create a tunnel specific to layer 2 and exclude the smaller pipes. In theory, it should work; but what HA/fault tolerance can’t you achieve within VMWare and Zerto? Seems to me, the disaster that you hope to recover from would most likely take out one end of the tunnel, right?

You might want to look at a recovery site in a cloud service and possibly extending your VMWare infrastructure to the second site. Then, you could run active/active across the link (separate work by location) with a centrally managed solution with vMotion as your primary protector. Then the cloud would be a recovery site that is accessible to both ends. Just throwing out some ideas. I love clustering across geographic boundaries. Usually, network and storage are your biggest hurdles. Storage just takes mirroring, but network requires the same IP addresses routeable from multiple locations. The PepVPN layer2 tunnel may be the answer for that. Good luck bud. I am curious how it works out for you.

SotYPL · March 22, 2017, 6:39am

But all of my pipes are small They are definitely not big enough to handle outside world traffic from both sites for all of my devices (I’m talking about upstream that would be needed). Any way I did some testing and it’s working correctly. I changed site 1 gateway LAN IP to 172.22.60.1/21 and site 2 to 172.22.59.1/21 (/21 is smallest that covers my both subnets) and changed PepVPN tunnel to L2. After that I can set IP of device that is currently at site 2 to address from site 1 (172.22.60.x) and even set its gateway address to 172.22.60.1 and it’s accessible from both networks. The only downside of that is outside world traffic from this device will go trough VPN and then site 1 gateway but that’s meybe even better because I can keep use port forwarding on site 1 gateway for my servers and don’t have to deal with changing DNS entries for public services.

And about your question: I’m trying to prepare myself for every possibility including fire in my server room. If all my hosts and network devices at site 1 are gone L2 VPN obviously makes no sense any more because site 1 gateway will be gone also But if I have some problems with power source for my hosts at site 1 I want to be able to start all of my VMs at site 2 as fast as possible with no or minimal configuration changes needed and that’s what L2 VPN will give I guess.

I can’t use vMotion effectively because I don’t have cluster or fast shared storage in my VMware infrastructure. I have 2 vCenter servers (1 for each site) and hosts with local storage.

jmjones · March 22, 2017, 7:18am

Gotcha. My work drank the KoolAid and we have a screaming infrastructure with 70-80 physical servers with SAN storage into ESX. Then, there is the special ESX environment that my application lives in (BizTalk). We use it for our integration engine at the hospital I work for.

Inside of that VMware environment, we have vMotion enabled AND we run Microsoft Failover Clusters inside there. The clusters use iSCSI for their shared storage (RDM sucks). We also have Zerto, but we aren’t using it for BizTalk stuff. Just a couple of source repository servers that have no internal HA.

I will caution you in trying to “plan for all possible failures”. The advice we were given by a consulting group was to identify Tier1 services first. It should be a very small subset of services compared to production. After that is identified, define 1 scenario and create a plan. Test the plan and adjust until it works. Then, create a second scenario, and repeat. After about 3 or 4 scenarios, you should be able to handle about 90% of DR scenarios just from those 3 or 4 defined scenarios. It is unrealistic to try to plan for every single problem.

Our first example scenario was a plane falling out of the sky and destroying our primary data center to the point where it would need to be rebuilt. Tier1 stuff was required to come online in the alternate site to provide services for other Tier1 apps within an hour. That scenario helped identify non-Tier1 stuff that had to be included since they were dependencies for Tier1 apps. It was a very boring excercise for sure. But, now we have organizational wide DR plans for every application and service tier. The biggest part of the plan is communication and staffing. The technology part is the easy bit. Good luck to you buddy.

SotYPL · March 22, 2017, 8:00am

We are not that big, actually we are very small compared to you We are electrical material distributor with 2 physical locations and a total of 5 ESXi hosts plus 2 physical servers for our PBX. If plane hits main location I will be dead probably so that will be not my problem to get anything running at second site any more But if something like power failure at main location happens and I have DR host running at second branch ready to start all of my VMs without the need of changing any configuration that’s more than enough for me. If my whole server room at main branch burns or lose power source (we have generator but UPS can fail so there will be some down time anyway to switch everything) L2 VPN will be useless and I will have to go to second branch and manually change configs if I want to have VMs accessible for branch 2 devices at least. I know that if I want to be 100% bullet proof I would have to spend thousands to upgrade my hardware but L2 VPN gives me some more peace of mind for free so I’m happy that Peplink allows that

EDIT: I just found one problem with that config. Conflicting DHCP servers. I have DHCP enabled on both gateways and right now device connected to network at site 1 can get IP address from site 2. Now I need to find a way to block DHCP requests over VPN. Is it even possible when it’s setup as Layer 2?

jmjones · March 22, 2017, 8:59am

It is possible to define dhcp reservations for addresses outside of the dhcp range. You could setup reservations for devices to grab the correct IP address from either DHCP server, but the default gateway assigned will be whichever dhcp server responds first. With a big layer2 running, this may cause inefficiencies. One of the Peplink folks may have some kind of solution. In a perfect world, I guess you would need a layer2 network that is independent of your real networks and only put devices on that segment that need the “failover” functionality. That would keep your clients from pulling wrong addresses, while allowing your services to live at either location.

SotYPL · March 22, 2017, 9:04am

DHCP uses UDP ports 67 and 68 for requests so I think I should be able to block it trough internal network firewall rules. I’m gonna try that and see what happens

SotYPL · March 22, 2017, 9:39am

Do you mean creating 2 vlans at both sites and create 2 tunnels: one Layer 3 and one Layer 2 for vlan where only my replicated host/vms will be right? But then I would need to enable Inter-VLAN routing so DHCP conflict would be still existing and additionally I will have problems with Peplink handling my LAN traffic between servers and other devices and that is kinda big overnight when backups are taken. Veeam is backing up my VMs to NAS and workstations backing up to CrashPlan server which is also one of the VMs so it needs to stay in this L2 connected vlan.

jmjones · March 22, 2017, 11:10am

This is fun. I hope you don’t rush out and start making changes based on ideas expressed. You sound very knowledgeable of your environment and that is comforting. I am certain you will do your due diligence with research, design, and analysis prior to jumping in. Since you have this in depth knowledge, what about adding another nic to your servers? You can have your replicated resources ONLY on the layer2 spanning both LANS (via second nic and dedicated VLan). All of your clients stay local to their respective LAN segments and would ALWAYS consume resources from the addresses on the dedicated VLans - so they don’t care where that really lives - only that they have a route to it. No client redirection needed - yay. The DHCP conflicts are easy enough, don’t use a DHCP server on the new dedicated VLan and force static address assignments OR make your range just big enough and use reservations OR make your range only one address big, and apply static IPs to the devices and use a reservation to help “manage” the static addresses. By manage - I mean have a list of what you have configured that is easy to get to. You could probably do something with an external DHCP server and use a relay as well - I would go all static since you are wanting to streamline and manage this spanning layer2. The only thing I can think that would bite you is if you require broadcast or multicast for your services inside the new VLan. I assume you aren’t running streaming media services - are you?

SotYPL · March 23, 2017, 9:48am

I think I’m not enough knowledgeable of networking and that’s the problem Anyway adding second NIC to servers would not be a problem, actually all of my ESXi hosts have at least 2 NICs already and main host that runs my ERP guest and one of my AD controllers have 4 NICs. But I’m ot sure if I want to complicate my infrastructure that much. Actually right now I only use DHCP for WiFi clients and SIP phones but I had a plan to switch all of my devices to DHCP just to make managing little bit easier. It seems that internal network firewall rules are working good enough. I can see in logs that some of the devices still see DHCP server at second site (firewall is layer 3 and DHCP broadcast is layer 2 so it bypass any firewalls?) but actual UDP request on ports 67-68 are blocked so device at site 1 should never get IP and gateway address from site 2 and vice versa. And I already have a list of all static IPs saved on Peplink routers and I also use Axence nVision software for network mapping. Anyway thank you so much for all suggestions, it will help me a lot if I decide to make this more bullet proof.

jmjones · March 23, 2017, 3:12pm

I started picking up the networking stuff when the team I depended on proved “baffled” at requests for firewall changes. They were simple tcp sockets. I provided all possible source IPs and destination IP and I wanted traceroute and ping opened up. You would be amazed at what I saw was implemented. The rule only has to match the first packet from the origin. Everything else goes to the state table and is allowed.

My point is - you know more than you think. Saying stuff like layer 2 layer 3 and UDP tells the real story. Yes, layer 2 broadcasts are not controlled at a layer 3 firewall. You basically turned the internet into a big crossover cable between your two networks. Only way to stop broadcast traffic is to cut the wire.

Good luck and clever fix for catching the DHCP requests/responses going back. Very clever.

SotYPL · March 24, 2017, 7:33am

I got another problem to solve but it’s a little off topic. To make my network more bullet proof I need to think about switches also. Right now all my switches are daisy-chained so if one switch goes down rest of the LAN goes down and if the switch that is connected to Peplink goes down my whole network is basically useless. My ESXi hosts are connected to two main 1Gbps switches using different NICs so they are already secured but I need to think about rest of the network. I plan to link 2 main switches using 2 SFP links with LCAP enabled and then connect each remaining switch to both main switches and have STP enabled. But I also need to connect Peplink to both main switches but I’m not sure if switch builtin Peplink has STP enabled by default or no. I can only find an option to enable STP for L2 VPN but not for switch.

I created some diagram how I want my network to look but now the question is: is it gonna work? I’m worried about broadcast storms with that kind of configuration and I’m still not sure if switch from Peplink has STP enabled by default.

jmjones · March 24, 2017, 9:12am

Wait for an expert, but I am pretty sure the switch does a “slow spanning tree” - which means the port is blocked until STP finishes.

If broadcast on layer 2 is a concern, then it is time for some VLans. Can you add the network spaces on your diagram?

SotYPL · March 24, 2017, 9:27am

Right now I only have two class C subnets: 172.22.60.1/24 for BR1 and 172.22.59.1/24 for BR2. But for testing of L2 VPN I have changed it to /21 to cover both subnets without changing devices addressing yet. L2 VPN is working fine with switches daisy-chained and there is not much of broadcast traffic going trough VPN yet. I’m not sure how it would change after I change mask for all of my devices to /21 and when I create this star topology for switches.

jmjones · March 24, 2017, 1:00pm

I would figure at most you would be doubling your layer2 broadcast traffic. Just ARP, DHCP discovery, and any routing protocols you have running.

You planning on doing a Private address VLan? Or planning on doing a second nic in the same subnet/supernet? There really isn’t much reason to have a public IP on an internal network is there?

I think your access switches are just hanging out according to your diagram. Are these the devices that users connect to?

SotYPL · March 24, 2017, 1:54pm

I would prefer to avoid any additional VLANs if it’s possible just to keep things simple. Right now I have only one small VLAN on one of the switches for my public IPs from one WAN and use it only to get one public IP for my PBX’s second NIC. So basically I only use this VLAN to isolate my WAN provider’s router from my LAN. Everything else is untagged.

Access switches are switches used for workstations and printers. Right now I have them daisy-chained with other switches but I plan to connect them like on the diagram so each access switch connected to each of the main servers switch. I think if I do it like my diagram shows failing of any switch in my LAN won’t take rest of the network down. Even when one of the server switches goes down my ESXi hosts are still connected to second, every other switch is connected to second one and Peplink router is still connected to working one also.

SotYPL · April 3, 2017, 2:38pm

So I have my switches connected as I planned and (almost) everything is working fine with STP enabled including Peplink builtin switch connected to both main server switches. I can unplug any of the individual links and not even one ICMP reply is lost. But one thing is weird. It’s seems that Internal Network Firewall Rules that I have created for DHCP (UPD ports 67 and 68) stopped working. My devices are getting IP from second branch like 50% of time. I don’t know why it is like that. I have opened SR with Peplink and will see if they can help me with that.