Except when they do go down it can be a catastrophic business interrupting event...

mr_toad · on Sept 14, 2018

Mainframes share a lot of similarities with cloud data centres. Redundant hot swappable components (even the CPUs in some models). Virtualised operating systems (VMs were invented for mainframes). These days it wouldn’t be unusual for some mainframes to be running mostly Linux instances.

You could almost think of a mainframe as a cloud in a box, and if one isn’t realiable enough, you can always run two or more.

toomuchtodo · on Sept 13, 2018

AWS, Google Cloud, and Azure break more often than our mainframes. In this case, the cloud model is a con, essentially the self checkout lane at Walmart. You're paying more for the same ability you had before.

"Well, you want redundancy right? Well you're supposed to be redundant across AZs, and then regions, and then you're going to have to have disparate vendors to mitigate sole vendor risks." And then we're right back to hosting our own mainframes in our datacenters.

nickpsecurity · on Sept 13, 2018

Clouds reinvent mainframes. They're still cheaper, run more FOSS, and have more talent available. They're a better form of lockin than mainframes.

mmt · on Sept 13, 2018

> They're still cheaper

Are they, though? Commodity hardware certainly is, but it's not as if cloud providers are charging a small margin on top of that. They're charging a multiple, potentially as large as 10x.

Combined with the parent's proposed need of multi-cloud, that could turn what might otherwise be a few hundred $k of commodity servers into a few $M of cloud costs, which I understand is the OOM the cost of a mainframe.

dfox · on Sept 14, 2018

Problem with IaaS clouds is that it is strictly less reliable than having your own infrastructure.

With IaaS typical reliability issue is that whole location/datacenter/AZ goes down, with your own infrastructure the typical issue is that the colo-facility/datacenter goes down, which would be essentially identical save the fact that with IaaS there is significantly larger probability that the reason for going down is some byzantine failure of orchestration automation, which in the self-hosted case either isn't there or is under your control.

One fact of running your own infrastructure is that you should plan for hardware failures, but not stress about it too much, because even entry-level enterprise-grade hardware just does not break (and if it does you will get signs that it is going to break well in advance).

TheCoelacanth · on Sept 14, 2018

It is only strictly less reliable than having your own infrastructure if you assume the same level of organizational competence at running your own infrastructure as the IaaS has at running theirs.

That is certainly possible for an organization to achieve, but it isn't easy and it isn't cheap. It certainly can't be taken for granted.

mmt · on Sept 14, 2018

> if you assume the same level of organizational competence at running your own infrastructure

I'm not sure that the competence required is organizational (e.g. people managing) so much as operational (e.g. best practices) and even technical, at least at sub-FAANMG scale.

> it isn't easy and it isn't cheap. It certainly can't be taken for granted.

I agree that it can't be taken for granted, but I disagree with it not being easy and cheap. Rather, combining the two, I don't believe it's necessarily hard nor necessarily expensive.

It just requires finding someone who both still has the competence, is willing to use it, and is willing to train others. Running your own infrastructure isn't actually difficult or complicated, but it's certainly not "sexy" and can be a bit tedious at times. That means it's possible to hire inexpensive, less (overall) experienced staff and have them handle that portion. Unfortunately, the "unsexy" part means finding someone to do the training, as well as the actual work when necessary, can be challenging, even though we're out there.

Even then, that's only necessary at substantial scale. In <1000 server environments, I've never had the hardware-specific [1] part of the work take up more than a quarter of one senior FTE (usually me).

What can get astronomically expensive is outsourcing the wrong things, though that ends up being a form of not actually running your own infrastructure (yourself).

Anecdote: I recently had a phone interview with a startup that moved from "hardware" to the cloud and the main reason cited was the inability to ramp capacity up fast enough (nor predictably fast enough), which seemed odd to me. One example of unpredictability of lead times involved a new server underperforming due to mis-applied thermal compound between the CPU and cooler, which I have never experienced [2]. I didn't ask the rhetorical question, "how could you have picked such a horrible VAR?!" Carefully re-reading the blog post about their transition gave me my "aha" moment: even though it's a company in the SFBA, their datacenter was out of state (maybe not even in a tech hub city, but it didn't specify). They were outsourcing the actual installation, running, and maintenance of their hardware to someone else, far away.

[1] for lack of a better term.. i.e. anything that an IaaS cloud provider would eliminate, including purchasing and vendor negotiations, colo space, network hardware and providers, hardware monitoring, and data destruction

[2] well, OK, that's a lie, since I've experienced it when I've personally done CPU moves/swaps/upgrades in exceptional circumstances, when I was out of practice, but I knew to test my work and caught the problem immediately. I've never had it happen with professionally-assembled systems, presumably because CPU coolers tend to arrive with the thermal compound pre-applied.

mmt · on Sept 14, 2018

Are you replying to the right sub-thread? I was asking about cost.

That said..

> Problem with IaaS clouds is that it is strictly less reliable than having your own infrastructure.

Although I'm a fan of running ones own hardware, I'm not sure I could make this claim. However, since I'd like to, do you have public data to back it up?

> orchestration automation, which in the self-hosted case either isn't there or is under your control.

I'm not sure how that would be different. Although a provider like AWS has portions of the automation toolchain not under your control, it's not obvious they're any more likely to fail (even due to some bizarre interoperability bug) than, say, BMC firmware, which is also not under your (full) control.

> One fact of running your own infrastructure is that you should plan for hardware failures, but not stress about it too much, because even entry-level enterprise-grade hardware just does not break (and if it does you will get signs that it is going to break well in advance).

This is something that I routinely have to point out to cloud proponents when they complain about having to "worry" about hardware failing: modern, commodity server hardware just doesn't fail often enough for it to be a significant consideration. Usually, it's just selection bias in that they remember every "nightmare" scenario from their past where hardware failed (possibly even as long as 20 years ago) but don't account for the overwhelming majority of times when it didn't.

Of course, there are notable exceptions, such as high-density "blade" or half-U servers, which often suffer from thermal design failures, but I argue that those are a departure from commodity, even if they appear identical if one squints.

Most importantly, though, it's not as if an IaaS cloud provider can somehow magically shield you from the consequences of such a failure: your VM will still go down. Sure, they have an arbitrarily large supply of spares to replace it, but you only ever need exactly 1 of those spares, and N+1 redundancy when self-hosted is very easy, if implemented merely as warm spares.

dfox · on Sept 14, 2018

I started with some comparison of cost of big IaaS instances vs. mainframe vs. commodity HW and then got sidetracked on the reliability and ended up deleting the first paragraph :)

That said I believe that when your workload really necessitates such a big systems and if you can use all of the mainframes capability and have the capability to manage the mainframe (which requires Ops team with totally different skillset and mainly willingness to do such a thing), I believe that cost of mainframe will be comparable to IaaS, with self-hosted commodity HW being somewhat cheaper.

It is anecdotal, but when I used to work in more of an Ops role I cannot remember single time when server hardware failed in production without external environmental cause (there were flooded servers and servers that were DoA from manufacturer), it is somewhat surprising that this experience even extends to spinning rust harddrives, where most common causes of failure I've seen were flaky SATA/SAS connetors, followed by simply bad series (eg. Constellation ES.2) and then by extreme overheating.

mmt · on Sept 14, 2018

> if you can use all of the mainframes capability

> cost of mainframe will be comparable to IaaS

I'd be very interested in seeing even a rough cost comparison, since I have no experience with mainframes.

I also have, essentially, no experience with that first "if", which I'd say is a big one. Workloads that are (already) suited to that particular system design may be rare (and obviously getting rarer).

> self-hosted commodity HW being somewhat cheaper.

I'm pretty sure "somewhat" grossly understates it.

However, I realized that one of the problems here is that we're talking about these costs as if they're single numbers, rather than ranges.

For mainframes, it may as well be a single number, because there's only one vendor (for the latest hardware).

For IaaS and self-hosted, the ranges can be very broad, because it's very easy to pay a multiple of the minimum cost with merely a naive implementation. Trivial examples would be not leveraging "reserved instances" on AWS or not getting competitive quotes for self-hosted. In fact, if one removes the "commodity" constraint from self-hosted and allows "enterprise" hardware (especially storage), the top end of the range can easily balloon above the top of the IaaS range.

I've been assuming a comparison of the bottom ends of the ranges, but including the cost of the expert labor for each. What's difficult to know, of course, is how scarce the experts (capable of keeping costs near the bottom of the range) in each category actually are.

> I cannot remember single time when server hardware failed in production without external environmental cause

I think that's just a matter of too small a sample size.

> servers that were DoA from manufacturer

I wouldn't even count that, since it's not in production yet.

> it is somewhat surprising that this experience even extends to spinning rust harddrives

That's definitely too small a sample size, then. If you're not seeing at least a 1% AFR (realistically closer to 3%), you don't have enough of them or haven't been running them long enough yet.

> simply bad series (eg. Constellation ES.2)

That's not an external environmental cause, though, and counts the same as any other failure due to (presumably) a manufacturing defect (defined broadly), including RAM bit errors. It's merely something that can be engineered around with best practices.

None of this is to say that any of these inevitable failures are actually frequent or voluminous enough (even on, e.g., 5 year old hardware, which is ancient by most standards) to require outsized worry or effort/cost to mitigate/repair them.

js8 · on Sept 14, 2018

It should be noted, though, that from physics perspective it makes sense to cram as much as computing power (although in case of MF, which unlike supercomputers do relatively simple calculations on high volumes of data, it rather means I/O throughput) into as small space as possible.

So having interconnected commodity servers will always be marginally more expensive to run than mainframe boxes of the size of fridge, which have dedicated hardware for interconnect of internal components (for example on-book CPU caches and shared RAM).

walshemj · on Sept 13, 2018

You can get very very redundant mainframes IBM Parallel Sysplex and so is Tandem

johnmarcus · on Sept 14, 2018

It's catastrophic for your business if either your cliud model or mainframe breaks, ergo, it's worth to choose the more reliable one where that happens less frequently.