Hi all, long time no post. Got a head scratcher of an L2 networking problem, whose solutions seems like it should be obvious, yet it eludes me.
Situation:
4 remote WAN sites connected from main campus via point to point wireless radios (Ubiquiti AirFiber/NanoBeam operating in bridge mode).
Managed L2 access switch at each remote site (HPE Aruba 2930F's)
Managed L2/L3 "core" switch at head end (HPE Aruba 5406R)
Trunking a handful of VLANs (tagged) to each site plus a management VLAN (untagged).
This has worked great forever. But we recently installed a new direct buried fiber circuit to each building and so we're moving off the PTP wireless link. I provisioned new SFP uplink ports on each end, mirrored the VLAN tagging, then powered off the wireless radio and disabled the old Ethernet uplink interface on the remote switch. Then I lit the new fiber uplink interface -- and it came online fine, no physical link trouble, good 1Gbps duplex with no errors.
Traffic-wise, everything worked at first -- all VLANs including management immediately began passing traffic on the new interface. But then maybe half an hour later, the link died. Or I should say, only traffic to the management VLAN died. The physical link remained online and the other tagged VLANs continued passing traffic, but I cannot talk to the switch directly via its management IP, nor can I talk to any other remote IPs in the management VLAN (e.g. WAP's -- they still serve their tagged VLANs, I just can't manage them).
Running continuous pings to a remote management IP shows intermittent results, pings will work for a few seconds then die for a few seconds, on and off erratically. Sometimes it comes back for a long stretch, sometimes not.
So after this happened at one site, I did a lot of troubleshooting, and finally switched it back to the wireless link and everything went back to normal 100%. So I moved on to a second site to try my luck, and the exact same problem happened. In this case the site ran for 4+ hours before the management traffic puked. I left it in that broken state overnight, and the next morning it was back to normal, with no loss of pings. Then about 6 hours later it started dying again.
It feels like an ARP/MAC issue, but I have tried every way I know of to clear the cache & tables on both ends with no luck. When I do a "show arp" on the core switch, the IP address corresponding to the remote switch's management IP shows the correct MAC, but it does not have an associated port entry.
The last little piece of the puzzle... Our old "core" switch (Cisco 3750) is still in use to an extent. It's largely used to route legacy VLANs that we're still in the process of retiring. However the management VLAN doesn't live there. The management VLAN is defined/directly connected on the 5406R core switch. The old 3750 is set to route any management VLAN traffic to the "new" 5406R core. That said, the existing remote wireless links are served off the old 3750 core. So I'm wondering if there's some kind of situation that is causing traffic destined for the same remote MAC to be unsure of which direction to go (old core/new core).
For clarity, the routing setup is like this:
Current VLANs, including the management VLAN are defined on the 5406.
For any other traffic (legacy VLANs or web), the 5406 default routes to the old 3750.
The 3750 has static routes back to the 5406 for any VLANs that live there.
I have a feeling one of you gurus will intuit the solution right away, but let me know if anything is unclear from my description.
Thanks for the time