Posts made by crustachio

crustachio

Post Script:

Immediately following my last "solution" update, I drove over to the remote site to button things up. En route I noticed a work crew standing around a concrete bridge over a small canal, which our fiber conduit happens to runs alongside. The bridge had just collapsed (nobody injured thankfully). Conduit is torn apart pretty good but the fiber is still in tact. Not sure it will stay that way, I can't see how they'll get the bridge removed without disturbing or removing that conduit entirely. There's also a gas line that runs alongside which complicates things further.

There's never a good time for something like that, but this was just plain uncanny.

crustachio

Welp, got it figured out, and it had nothing to do with any of my theories

The "other" access switch that was generating all the BPDU starvation errors was also a remote switch at a completely different site (unrelated to this fiber replacement), connected via PTP Ubiquiti NanoBeam radio. The head-end radio, even though it was set for simple bridge mode, had STP toggled on for some [mistaken] reason. Of course Ubiquiti NanoBeams don't speak HPE MSTP, so it was borking the BPDUs to that remote switch. Since that switch was getting starved for BPDUs, it was self-promoting to root bridge. Of course on the upstream switch I had root-guard enabled to prevent the remote switch from actually becoming root, but the TCNs still propagated out and somehow kept crippling the original problem switch on the new fiber. I'm not sure why it was only causing problems on these remote switches on the new fiber, and no other switches/links, but hey.

Final solution: Disable STP on the Ubiquiti radio. BPDU starvation resolved immediately, remote fiber switches management VLAN connectivity restored also. Problem solved.

Thanks very much to all for being a sounding board and the great suggestions. Special thanks to @notverypunny for pointing me in the right direction with STP. Teaches me to step back and look at the patterns.

crustachio

OK still not sure why that "other" access switch on the LAN is getting starved for BPDU packets, but as a band-aid I enabled "tcn-guard" on its upstream port, to prevent it's topology change notifications from flooding the network and goofing the remote fiber switch. So far, so good.

I wonder if this is some odd interop issue from the fact that our old 3750 is still on the LAN running its default flavor of PVST. Our Aruba is doing MSTP and has been interop'ing fine alongside the 3750 until now. The plot thickens!

If nothing else this will motivate me to finish pulling the plug on that old 3750. Got some work to do yet...

crustachio

@notverypunny said in L2 network head scratcher, losing pings to Management VLAN:

A lot of good things to consider here so far. Keep spanning tree in mind as soon as you're dealing with topology changes and intermittent issues. It can come up and bite you in the a$$ if you've got a static config somewhere or a new vlan that isn't part of the config.

I think you are on to something. I had discarded STP from being in the mix at first because we're really not doing any complicated STP -- no PVST or anything. I checked right away to confirm the 5406 was the root, and the remote switch is an appropriately low priority, and everything looked normal. But digging into the STP topology change history logs on the switches does in fact show numerous topo change requests happening, and in the last 15 minutes I've correlated intermittent responsiveness on the remote switch to topo change requests coming from a completely different L2 access switch on the LAN.

That switch is generating "CIST starved for a BPDU Rx on port 1 (uplink port)" error and therefore self-promoting to root, forcing topo changes across the tree.

If I manually set the STP priority on that switch and let STP reconverge, things go back to normal for a short while and the fiber "problem" switch comes back. Until the "CIST starved for a BPDU Rx" error reoccurs on the other switch, then things go haywire again.

OK, now we're getting somewhere. Not sure why that port is no longer receiving BPDU packets... filtering is not enabled, there's no root-guard in place. I'll keep digging, but now I'm on the trail.

THANKS

crustachio

@Dashrender said in L2 network head scratcher, losing pings to Management VLAN:

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

I need to go back onsite and console into the remote switch to see if pings work the other way.

If you have a PC at that remote site - since you said normal data VLANs are working, you could remote into one of them and then access a switch and see if it pinging on that side is working.

Nice suggestion but the remote PC VLAN is not authorized to SSH into to the management VLAN of the switch.

@Dashrender said in L2 network head scratcher, losing pings to Management VLAN:

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

...we recently installed a new direct buried fiber circuit to each building

this is fiber you own, it doesn't go through a carrier like AT&T/Cox/Comcast/etc?

We own it, it's a simple PTP SMF span.

crustachio

@Dashrender said in L2 network head scratcher, losing pings to Management VLAN:

So you made a new VLAN specifically for management, alright. and what are you pinging and from where on this new management VLAN?

i.e. are you pinging the switch connected to the fiber on the far side? are you pinging from the switch connected to that same fiber on your side? or your PC?

Pinging fails FROM any host in the mgmt VLAN on the local side TO any host in the mgmt VLAN on the far side. That includes the remote switch, a UPS, and a WAP.

On the local side, I've tried pinging from my PC (which is not ACL restricted from talking to the mgmt VLAN or anything), the core switch itself, and other switches in the mgmt VLAN. And of course our NMS.

I need to go back onsite and console into the remote switch to see if pings work the other way.

crustachio

I need to clarify something I said erroneously:

"The old 3750 is set to route any management VLAN traffic to the "new" 5406R core. That said, the existing remote wireless links are served off the old 3750 core. So I'm wondering if there's some kind of situation that is causing traffic destined for the same remote MAC to be unsure of which direction to go (old core/new core)."

That bolded sentence is actually untrue. I don't know why I was thinking the wireless link was still served off the 3750. We moved it to the 5406 awhile back and it has been working fine.

So to clarify, the "working" link (wireless bridge) actually terminates in an L2 access switch on the roof, which trunks back to the 5406 core. The "new" fiber link terminates directly on the 5406. The mgmt VLAN only lives on the 5406. There should be no way any traffic is trying to go out to the 3750. Traceroute confirms this -- when the fiber link is working (intermittently), traceroute shows a hop from my PC to the 5406, then to the remote switch. When the fiber link is down, traceroute hops to the 5406 then dies.

crustachio

@Dashrender said in L2 network head scratcher, losing pings to Management VLAN:

@crustachio said in L2 network head scratcher, losing pings to Management VLAN:

y VLANs that we're still in the process of retiring. However the management VLAN doesn't live there. The management VLAN is

Is the VLAN ID the same for the old and the new Management VLAN?

There was no "old" management VLAN (I know right). Mgmt was done in the default VLAN (1) on the 3750. Hence the creation of a dedicated mgmt VLAN on the 5406 when we started migrating off the 3750.

crustachio

@Kelly
Thanks. Traceroute functions as expected; when testing from the 5406 core it simply times out, no hops (which there shouldn't be since it's directly connected). It never attempts to route via another path.

crustachio

Hi all, long time no post. Got a head scratcher of an L2 networking problem, whose solutions seems like it should be obvious, yet it eludes me.

Situation:

4 remote WAN sites connected from main campus via point to point wireless radios (Ubiquiti AirFiber/NanoBeam operating in bridge mode).
Managed L2 access switch at each remote site (HPE Aruba 2930F's)
Managed L2/L3 "core" switch at head end (HPE Aruba 5406R)

Trunking a handful of VLANs (tagged) to each site plus a management VLAN (untagged).

This has worked great forever. But we recently installed a new direct buried fiber circuit to each building and so we're moving off the PTP wireless link. I provisioned new SFP uplink ports on each end, mirrored the VLAN tagging, then powered off the wireless radio and disabled the old Ethernet uplink interface on the remote switch. Then I lit the new fiber uplink interface -- and it came online fine, no physical link trouble, good 1Gbps duplex with no errors.

Traffic-wise, everything worked at first -- all VLANs including management immediately began passing traffic on the new interface. But then maybe half an hour later, the link died. Or I should say, only traffic to the management VLAN died. The physical link remained online and the other tagged VLANs continued passing traffic, but I cannot talk to the switch directly via its management IP, nor can I talk to any other remote IPs in the management VLAN (e.g. WAP's -- they still serve their tagged VLANs, I just can't manage them).

Running continuous pings to a remote management IP shows intermittent results, pings will work for a few seconds then die for a few seconds, on and off erratically. Sometimes it comes back for a long stretch, sometimes not.

So after this happened at one site, I did a lot of troubleshooting, and finally switched it back to the wireless link and everything went back to normal 100%. So I moved on to a second site to try my luck, and the exact same problem happened. In this case the site ran for 4+ hours before the management traffic puked. I left it in that broken state overnight, and the next morning it was back to normal, with no loss of pings. Then about 6 hours later it started dying again.

It feels like an ARP/MAC issue, but I have tried every way I know of to clear the cache & tables on both ends with no luck. When I do a "show arp" on the core switch, the IP address corresponding to the remote switch's management IP shows the correct MAC, but it does not have an associated port entry.

The last little piece of the puzzle... Our old "core" switch (Cisco 3750) is still in use to an extent. It's largely used to route legacy VLANs that we're still in the process of retiring. However the management VLAN doesn't live there. The management VLAN is defined/directly connected on the 5406R core switch. The old 3750 is set to route any management VLAN traffic to the "new" 5406R core. That said, the existing remote wireless links are served off the old 3750 core. So I'm wondering if there's some kind of situation that is causing traffic destined for the same remote MAC to be unsure of which direction to go (old core/new core).

For clarity, the routing setup is like this:

Current VLANs, including the management VLAN are defined on the 5406.
For any other traffic (legacy VLANs or web), the 5406 default routes to the old 3750.
The 3750 has static routes back to the 5406 for any VLANs that live there.

I have a feeling one of you gurus will intuit the solution right away, but let me know if anything is unclear from my description.

Thanks for the time

crustachio

@dashrender said in 802.1x port-based authentication - when and why?:

The whole disabling ports seems like a waste of time. If someone wants on the network, they'll simply unplug a printer and plug in. They know that line is live. Or they will unplug their own computer, again, they know it's live.

This is actually the real power of 802.1x. It can do more than just toggle a switchport on/off. If you tie your 802.1x implementation to a policy manager/access server, you can dynamically assign VLANs and/or ACLs to that switchport.

So that printer is live on the network because it matches certain criteria (certificate, predefined MAC whitelist, device fingerprint, etc), but if someone unplugs it and plugs their laptop in the same port it no longer matches and is blackholed (or gets whatever policy you wish). Same with swapping your LAN PC for a BYOD laptop. The traditional "port tagged as VLAN xyz" can't protect you in this situation, but a policy-based 802.1x implementation gives you total control.

Of course you need a NAC server of some kind to be able to achieve this, but in the spirit of the OP, 802.1x can do quite a lot more than just basic switchport toggling.

Also, it's commonly relied on for WiFi access control. When you consider any WiFi network that touches the LAN as essentially an invisible switch that anyone can touch without physical access restrictions, then 802.1x auth starts to look pretty attractive.

crustachio

What WPA2 security mode are you using? Make sure it is AES not TKIP.

Also a known-good USB wireless adapter is handy to keep around for times like this.

crustachio

Excellent, thanks for this. I had recently heard about this option 43 but didn't really know what it was about. I'll try this... now, actually! Need to set up a spare AC mesh anyway.

crustachio

@dustinb3403 Right, you have to pay a license for each endpoint managed in B&R. Or you can manage them locally on each server - they will still backup to a B&R repo and you can see their job history in B&R, but you can't manage the jobs themselves from there.

crustachio

@shuey

Ah, sorry, bad assumption on my part.

crustachio

Yeah, 6 NICs for VM network traffic seems like overkill.

Don't forget vMotion traffic, which I'm assuming you will be utilizing. You will need at a minimum a vmkernel port on one of your vSwitches, and optionally (and recommended depending on your size) dedicated NIC(s) in place for this as well.

crustachio

@irj said in Scale Computing combines forces with Unitrends:

Does no one realize how brilliant this response was?

So you're a Unitrends customer as well?

crustachio

A Unitrends representative was quoted as saying "This is really an unbelievable, once in a lifetime opportunity. I'd like to emphasize that we've never offered a deal like this to anyone else, ever, in the history of time. If Scale doesn't sign up by the end of Q1, prices for this deal will go up at least 30%. And I can guarantee you'll never get a better offer than this one."

Jeff Ready, CEO and co-founder of Scale Computing responded "While I'm attracted by the offer, I'm not sure I'm ready to sign just yet. I'm always a little hesitant of long-term vendor lock in, y'know?"

The Unitrends rep quickly fired back "Sure, sure. Hey, let me check with my manager and see if there's any wiggle room on that pricing -- I'm really sticking my neck out for you on this one, but I might be able to work some magic."

Sources indicated that Scale Computing will be required to trade in their old 823 appliance as part of this deal, although Unitrends Support could not be reached in a timely manner for confirmation.

crustachio

@scottalanmiller LOL, SAM.

crustachio

@scottalanmiller No and no.