L2 network head scratcher, losing pings to Management VLAN

crustachio

Hi all, long time no post. Got a head scratcher of an L2 networking problem, whose solutions seems like it should be obvious, yet it eludes me.

Situation:

4 remote WAN sites connected from main campus via point to point wireless radios (Ubiquiti AirFiber/NanoBeam operating in bridge mode).
Managed L2 access switch at each remote site (HPE Aruba 2930F's)
Managed L2/L3 "core" switch at head end (HPE Aruba 5406R)

Trunking a handful of VLANs (tagged) to each site plus a management VLAN (untagged).

This has worked great forever. But we recently installed a new direct buried fiber circuit to each building and so we're moving off the PTP wireless link. I provisioned new SFP uplink ports on each end, mirrored the VLAN tagging, then powered off the wireless radio and disabled the old Ethernet uplink interface on the remote switch. Then I lit the new fiber uplink interface -- and it came online fine, no physical link trouble, good 1Gbps duplex with no errors.

Traffic-wise, everything worked at first -- all VLANs including management immediately began passing traffic on the new interface. But then maybe half an hour later, the link died. Or I should say, only traffic to the management VLAN died. The physical link remained online and the other tagged VLANs continued passing traffic, but I cannot talk to the switch directly via its management IP, nor can I talk to any other remote IPs in the management VLAN (e.g. WAP's -- they still serve their tagged VLANs, I just can't manage them).

Running continuous pings to a remote management IP shows intermittent results, pings will work for a few seconds then die for a few seconds, on and off erratically. Sometimes it comes back for a long stretch, sometimes not.

So after this happened at one site, I did a lot of troubleshooting, and finally switched it back to the wireless link and everything went back to normal 100%. So I moved on to a second site to try my luck, and the exact same problem happened. In this case the site ran for 4+ hours before the management traffic puked. I left it in that broken state overnight, and the next morning it was back to normal, with no loss of pings. Then about 6 hours later it started dying again.

It feels like an ARP/MAC issue, but I have tried every way I know of to clear the cache & tables on both ends with no luck. When I do a "show arp" on the core switch, the IP address corresponding to the remote switch's management IP shows the correct MAC, but it does not have an associated port entry.

The last little piece of the puzzle... Our old "core" switch (Cisco 3750) is still in use to an extent. It's largely used to route legacy VLANs that we're still in the process of retiring. However the management VLAN doesn't live there. The management VLAN is defined/directly connected on the 5406R core switch. The old 3750 is set to route any management VLAN traffic to the "new" 5406R core. That said, the existing remote wireless links are served off the old 3750 core. So I'm wondering if there's some kind of situation that is causing traffic destined for the same remote MAC to be unsure of which direction to go (old core/new core).

For clarity, the routing setup is like this:

Current VLANs, including the management VLAN are defined on the 5406.
For any other traffic (legacy VLANs or web), the 5406 default routes to the old 3750.
The 3750 has static routes back to the 5406 for any VLANs that live there.

I have a feeling one of you gurus will intuit the solution right away, but let me know if anything is unclear from my description.

Thanks for the time

Kelly

Given that you're seeing intermittent service you probably have a dual routing issue. If you can use pathping or traceroute to determine the paths that your traffic is taking that might give you some visibility into how your traffic is flowing. If you haven't already pruned it, I would remove all of the routes off the 3750 if possible.

crustachio

@Kelly
Thanks. Traceroute functions as expected; when testing from the 5406 core it simply times out, no hops (which there shouldn't be since it's directly connected). It never attempts to route via another path.

Dashrender

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

y VLANs that we're still in the process of retiring. However the management VLAN doesn't live there. The management VLAN is

Is the VLAN ID the same for the old and the new Management VLAN?

thecreaitvone91

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

The last little piece of the puzzle... Our old "core" switch (Cisco 3750) is still in use to an extent. It's largely used to route legacy VLANs that we're still in the process of retiring. However the management VLAN doesn't live there. The management VLAN is defined/directly connected on the 5406R core switch. The old 3750 is set to route any management VLAN traffic to the "new" 5406R core. That said, the existing remote wireless links are served off the old 3750 core. So I'm wondering if there's some kind of situation that is causing traffic destined for the same remote MAC to be unsure of which direction to go (old core/new core).

Are you sure you don't have asymmetric routing going on? You can make asymmetric work if there is a vaild reason for it however things like firewalls and routers have to be setup for this otherwise the packets will be dropped.

notverypunny

A lot of good things to consider here so far. Keep spanning tree in mind as soon as you're dealing with topology changes and intermittent issues. It can come up and bite you in the a$$ if you've got a static config somewhere or a new vlan that isn't part of the config.

crustachio

@Dashrender said in L2 network head scratcher, losing pings to Management VLAN:

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

y VLANs that we're still in the process of retiring. However the management VLAN doesn't live there. The management VLAN is

Is the VLAN ID the same for the old and the new Management VLAN?

There was no "old" management VLAN (I know right). Mgmt was done in the default VLAN (1) on the 3750. Hence the creation of a dedicated mgmt VLAN on the 5406 when we started migrating off the 3750.

Dashrender

So you made a new VLAN specifically for management, alright. and what are you pinging and from where on this new management VLAN?

i.e. are you pinging the switch connected to the fiber on the far side? are you pinging from the switch connected to that same fiber on your side? or your PC?

crustachio

I need to clarify something I said erroneously:

"The old 3750 is set to route any management VLAN traffic to the "new" 5406R core. That said, the existing remote wireless links are served off the old 3750 core. So I'm wondering if there's some kind of situation that is causing traffic destined for the same remote MAC to be unsure of which direction to go (old core/new core)."

That bolded sentence is actually untrue. I don't know why I was thinking the wireless link was still served off the 3750. We moved it to the 5406 awhile back and it has been working fine.

So to clarify, the "working" link (wireless bridge) actually terminates in an L2 access switch on the roof, which trunks back to the 5406 core. The "new" fiber link terminates directly on the 5406. The mgmt VLAN only lives on the 5406. There should be no way any traffic is trying to go out to the 3750. Traceroute confirms this -- when the fiber link is working (intermittently), traceroute shows a hop from my PC to the 5406, then to the remote switch. When the fiber link is down, traceroute hops to the 5406 then dies.

crustachio

@Dashrender said in L2 network head scratcher, losing pings to Management VLAN:

So you made a new VLAN specifically for management, alright. and what are you pinging and from where on this new management VLAN?

i.e. are you pinging the switch connected to the fiber on the far side? are you pinging from the switch connected to that same fiber on your side? or your PC?

Pinging fails FROM any host in the mgmt VLAN on the local side TO any host in the mgmt VLAN on the far side. That includes the remote switch, a UPS, and a WAP.

On the local side, I've tried pinging from my PC (which is not ACL restricted from talking to the mgmt VLAN or anything), the core switch itself, and other switches in the mgmt VLAN. And of course our NMS.

I need to go back onsite and console into the remote switch to see if pings work the other way.

Dashrender

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

I need to go back onsite and console into the remote switch to see if pings work the other way.

If you have a PC at that remote site - since you said normal data VLANs are working, you could remote into one of them and then access a switch and see if it pinging on that side is working.

Dashrender

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

...we recently installed a new direct buried fiber circuit to each building

this is fiber you own, it doesn't go through a carrier like AT&T/Cox/Comcast/etc?

crustachio

@Dashrender said in L2 network head scratcher, losing pings to Management VLAN:

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

I need to go back onsite and console into the remote switch to see if pings work the other way.

If you have a PC at that remote site - since you said normal data VLANs are working, you could remote into one of them and then access a switch and see if it pinging on that side is working.

Nice suggestion but the remote PC VLAN is not authorized to SSH into to the management VLAN of the switch.

@Dashrender said in L2 network head scratcher, losing pings to Management VLAN:

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

...we recently installed a new direct buried fiber circuit to each building

this is fiber you own, it doesn't go through a carrier like AT&T/Cox/Comcast/etc?

We own it, it's a simple PTP SMF span.

crustachio

@notverypunny said in L2 network head scratcher, losing pings to Management VLAN:

A lot of good things to consider here so far. Keep spanning tree in mind as soon as you're dealing with topology changes and intermittent issues. It can come up and bite you in the a$$ if you've got a static config somewhere or a new vlan that isn't part of the config.

I think you are on to something. I had discarded STP from being in the mix at first because we're really not doing any complicated STP -- no PVST or anything. I checked right away to confirm the 5406 was the root, and the remote switch is an appropriately low priority, and everything looked normal. But digging into the STP topology change history logs on the switches does in fact show numerous topo change requests happening, and in the last 15 minutes I've correlated intermittent responsiveness on the remote switch to topo change requests coming from a completely different L2 access switch on the LAN.

That switch is generating "CIST starved for a BPDU Rx on port 1 (uplink port)" error and therefore self-promoting to root, forcing topo changes across the tree.

If I manually set the STP priority on that switch and let STP reconverge, things go back to normal for a short while and the fiber "problem" switch comes back. Until the "CIST starved for a BPDU Rx" error reoccurs on the other switch, then things go haywire again.

OK, now we're getting somewhere. Not sure why that port is no longer receiving BPDU packets... filtering is not enabled, there's no root-guard in place. I'll keep digging, but now I'm on the trail.

THANKS

crustachio

OK still not sure why that "other" access switch on the LAN is getting starved for BPDU packets, but as a band-aid I enabled "tcn-guard" on its upstream port, to prevent it's topology change notifications from flooding the network and goofing the remote fiber switch. So far, so good.

I wonder if this is some odd interop issue from the fact that our old 3750 is still on the LAN running its default flavor of PVST. Our Aruba is doing MSTP and has been interop'ing fine alongside the 3750 until now. The plot thickens!

If nothing else this will motivate me to finish pulling the plug on that old 3750. Got some work to do yet...

crustachio

Welp, got it figured out, and it had nothing to do with any of my theories

The "other" access switch that was generating all the BPDU starvation errors was also a remote switch at a completely different site (unrelated to this fiber replacement), connected via PTP Ubiquiti NanoBeam radio. The head-end radio, even though it was set for simple bridge mode, had STP toggled on for some [mistaken] reason. Of course Ubiquiti NanoBeams don't speak HPE MSTP, so it was borking the BPDUs to that remote switch. Since that switch was getting starved for BPDUs, it was self-promoting to root bridge. Of course on the upstream switch I had root-guard enabled to prevent the remote switch from actually becoming root, but the TCNs still propagated out and somehow kept crippling the original problem switch on the new fiber. I'm not sure why it was only causing problems on these remote switches on the new fiber, and no other switches/links, but hey.

Final solution: Disable STP on the Ubiquiti radio. BPDU starvation resolved immediately, remote fiber switches management VLAN connectivity restored also. Problem solved.

Thanks very much to all for being a sounding board and the great suggestions. Special thanks to @notverypunny for pointing me in the right direction with STP. Teaches me to step back and look at the patterns.

crustachio

Post Script:

Immediately following my last "solution" update, I drove over to the remote site to button things up. En route I noticed a work crew standing around a concrete bridge over a small canal, which our fiber conduit happens to runs alongside. The bridge had just collapsed (nobody injured thankfully). Conduit is torn apart pretty good but the fiber is still in tact. Not sure it will stay that way, I can't see how they'll get the bridge removed without disturbing or removing that conduit entirely. There's also a gas line that runs alongside which complicates things further.

There's never a good time for something like that, but this was just plain uncanny.

Dashrender

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

Post Script:

Immediately following my last "solution" update, I drove over to the remote site to button things up. En route I noticed a work crew standing around a concrete bridge over a small canal, which our fiber conduit happens to runs alongside. The bridge had just collapsed (nobody injured thankfully). Conduit is torn apart pretty good but the fiber is still in tact. Not sure it will stay that way, I can't see how they'll get the bridge removed without disturbing or removing that conduit entirely. There's also a gas line that runs alongside which complicates things further.

There's never a good time for something like that, but this was just plain uncanny.

oh man - at least you still have the wifi beam connection option.