Cloudatcost

scottalanmiller

I see the SBC Global link has started responding again, sounds like that is the start of repairs getting done.

I see the SBC Global link has started responding again, sounds like that is the start of repairs getting done.

With this much wide spread outage I'm wondering if it was an attack on the backbones rather than someone cutting fiber

scottalanmiller

@thecreativeone91 said:

@scottalanmiller said:

I see the SBC Global link has started responding again, sounds like that is the start of repairs getting done.

With this much wide spread outage I'm wondering if it was an attack on the backbones rather than someone cutting fiber

That certainly seems possible.

mlnews

Rogers is stating that the issues are power related.

mlnews

But it appears that the outage is spreading east. Maybe the outages are not related?

Dunno, I'd be surprised if they didn't have both UPS and large Natural Gas (connected to underground lines so little limit on run time) Generators on the backbone.

scottalanmiller

@thecreativeone91 said:

Dunno, I'd be surprised if they didn't have both UPS and large Natural Gas Generators on the backbone.

One would sure hope. Although if they had an issue post-UPS, like a short or a fire, it could cause a power-based outage even with those things in place. But that would be pretty dramatic.

But I agree, an extended outage like this suggests that this is a cover up and not what actually happened.

mlnews

@Reid-Cooper said:

I was wondering about feedback on why a single ISP failure took down the entire cloud rather than failing over to another ISP. I would expect a few seconds disruption and slower performance, not an outage from just Rogers being offline.

Looks as though they tried to go to a backup this morning. No updates in a very long time.

PSX_Defector

@thecreativeone91 said:

@scottalanmiller said:

I see the SBC Global link has started responding again, sounds like that is the start of repairs getting done.

With this much wide spread outage I'm wondering if it was an attack on the backbones rather than someone cutting fiber

Probably what happened was that the BGP route to C@C was dropped, hence the rest of the network gave up on trying to get it to the destination. This happens when a published route path goes down, e.g. fiber cut into the locale. That's why we were seeing it drop within our own ISP's network, because there was no published route to them. Once the circuit was back up, the BGP routing was fixed and it sent the traffic onto the backbones.

Although I'm seeing the same thing right now on TWC, haven't checked AT&T.

scottalanmiller

@thecreativeone91 How the heck could Rogers be down all freaking day and only now get around to effing dispatching someone?

scottalanmiller

@PSX_Defector said:

Although I'm seeing the same thing right now on TWC, haven't checked AT&T.

On SBC / AT&T here.

@scottalanmiller said:

@thecreativeone91 How the heck could Rogers be down all freaking day and only now get around to effing dispatching someone?

Yeah it just doesn't make sense if it was only a hardware issue they would have fixed it long ago. Heck they probably have alerts on it and know before anyone else does that it's down.

scottalanmiller

@thecreativeone91 said:

@scottalanmiller said:

@thecreativeone91 How the heck could Rogers be down all freaking day and only now get around to effing dispatching someone?

Yeah it just doesn't make sense if it was only a hardware issue they would have fixed it long ago. Heck they probably have alerts on it and know before anyone else does that it's down.

One would sure hope.

PSX_Defector

@scottalanmiller said:

@PSX_Defector said:

Although I'm seeing the same thing right now on TWC, haven't checked AT&T.

On SBC / AT&T here.

There's actually two different routes with them. There's ATTIS, which is usually the DSL pipes and some hi-cap, then there is the U-Verse platform which takes a different route. As far as I can see, it's a BGP route issue since it affects multiple providers.

I see this from TWC:

C:\Users\v436525>tracert jump.ntg.co

Tracing route to jump.ntg.co [168.235.144.189]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms agrer003-ip002001.noa.vmotion.tmrk.eu [172.16.2.
1]
2 15 ms 26 ms 17 ms cpe-76-186-176-1.tx.res.rr.com [76.186.176.1]
3 10 ms 11 ms 10 ms tge7-2.allntx3901h.texas.rr.com [24.164.210.241]

4 14 ms 15 ms 15 ms tge0-8-0-7.plantxmp01r.texas.rr.com [24.175.37.2
12]
5 13 ms 15 ms 15 ms agg27.crtntxjt01r.texas.rr.com [24.175.36.177]
6 * * * Request timed out.

My route from AT&T don't want to come through on pfSense right now. I can certainly generate it once I get off online and just hard drop the AT&T line. Still technically have a few minutes to go at the big red V. Although I am super drunk right now.

People are now reporting Cell Phone outages..

http://www.akamai.com/html/technology/dataviz1.html

Interesting. They are reporting 53% above average attack traffic in North America today.. No idea of the relevance.

scottalanmiller

@thecreativeone91 said:

People are now reporting Cell Phone outages..

Television too.

Well hurricane Electric routing tables have been completely flushed of all routes there. I tried a trace route from one of their routers and it just fails with no route on the first hop.

AmanBhogal

Hey Everybody!
Late night update for you all. It's almost 9pm and we are all scrambling to make sure everything gets resolved ASAP.

Rogers is telling us that we should be up and running by midnight EST tonight.

As you can all understand, this was a pretty heavy hit for us today and any time something like this happens, its always unfortunate. No company is too big or too small to avoid these failures, and seeing Rogers experience massive network issues is a prime example of that.

For me personally, this could not have come at a worse time. I really enjoy interacting with each and every one of you, reading your posts about cloudatcost, seeing comments about how you are planning on using your servers, then all of a sudden... BOOM!

Regardless, I am still here to answer any questions you may have and I will make sure that once the dust has settled on this, we do something awesome for ML users.

Comparing Cloudatcost to many of the names you knew before you heard of us, we are small... but we are determined, and we WILL learn from this, and we WILL find a way to make it so that next time this happens, we are able to get everybody back up sooner and quicker than today.