ZeroTier network blip

scottalanmiller

Both the website and the ZT network were down.

I suppose it could be a DNS issue.

I'm assuming that this is a hosted ZT instance and not one that you run yourself, hence the question?

Dashrender

Correct. I am using a free account currently. I now have 4 devices connected, soon to be 6.

I doubt it was a DNS issue, but it's possible.

stacksofplates

I noticed the site was down last night for a couple minutes, but I still had access to the other devices on the network.

The site (controller) should be able to go down and everything should still be able to communicate. You just can't add/remove/change devices u til the controller comes back.

Dashrender

I'm guessing the user in my case had the computer off, when they turned it off, the controller was offline, therefore they couldn't register with the network, and were down.

adam.ierymenko

We caught a network glitch on the web site, but this should not have affected actual virtual networks. If it did then please explain what you saw -- the system should not be vulnerable to this.

FYI network controllers issue config and certificates to network members but are not (by design) a point of failure for actual network communications. If a network controller goes down the network continues to work, but it just isn't possible to change the network (add new devices, de-authorize devices, change IP assignment settings, etc.).

We're doing a round of infrastructure upgrades in the next few weeks anyway. Web will go to redundant bare metal servers and the root infrastructure (which is critical) is getting even more robust and geo-distributed. (It's already spread across three providers on four continents and all nodes are independent.)

coliver

@adam.ierymenko said:

We caught a network glitch on the web site, but this should not have affected actual virtual networks. If it did then please explain what you saw -- the system should not be vulnerable to this.

FYI network controllers issue config and certificates to network members but are not (by design) a point of failure for actual network communications. If a network controller goes down the network continues to work, but it just isn't possible to change the network (add new devices, de-authorize devices, change IP assignment settings, etc.).

We're doing a round of infrastructure upgrades in the next few weeks anyway. Web will go to redundant bare metal servers and the root infrastructure (which is critical) is getting even more robust and geo-distributed. (It's already spread across three providers on four continents and all nodes are independent.)

What happens if a machine goes down and then comes back up when the network controller is down?

adam.ierymenko

It should use its cached network config and certs -- see the networks.d/<nwid>.conf files, etc.

Dashrender

@Dashrender said:

I'm guessing the user in my case had the computer off, when they turned it off, the controller was offline, therefore they couldn't register with the network, and were down.

I'll have to confirm, but I believe this is the situation in question. The Laptop was turn off at the time of the outage. They turned it on during the outage and tried to connect. and it didn't.. The user did no troubleshooting, and before I could do much, the problem was over.

adam.ierymenko

There can be issues if a network controller is down for a long time because certs have (effective) TTLs, so an old node that's been offline could be unable to communicate. But it would have to be down for a while. Since ZT addresses are portable if a controller goes down it can be brought up elsewhere with the same identity (failover).

We're adding multi-homing soon, which will make this even more robust:

https://github.com/zerotier/ZeroTierOne/blob/adamierymenko-dev/node/Cluster.hpp#L71

Multi-homing will also be useful for nodes within networks. For example, you could create a global Cassandra cluster behind a single IP on your virtual LAN. Next version should contain an alpha version of cluster/multi-homing capability.

adam.ierymenko

@Dashrender How long was the laptop asleep? If it was a while it's possible that its cert was no longer valid and it couldn't get a new one.

Unlucky moment... multi-homing/cluster of network controllers should make that orders of magnitude less likely. We're doing a lot of robustness work right now (not that it's bad as-is).

Dashrender

@adam-ierymenko I'm guessing the laptop was off two+ days. The user only uses it two days a week at most.