With hard drive size increasing so quickly but hard drive transfer speeds basically flat, I wonder if there are long-term implications for them with respect to recovery from backup and downtime. For example, if a whole rack goes down, and they are on 32TB drives in the future, for example, could it takes a week or more for their data to get online?
No one really expects that rotational rust will get much faster, and in fact history shows that, compared to the increase in density, the increase in transfer rates are laughable at best. Between 1990 and today you are probably looking at a 20 000 times increase in density, yet transfer rates only increased around a factor of around 150-200. [In fact, from the early 1960 to today it's only a factor of about 1000. There are quite possibly few performance metrics that increased so slowly as disk transfer rate].
Why is that?
Increasing density does only marginally increase transfer speed: Most density increases are achieved by packing more tracks onto the platter, while storing more sectors per track plays a minor role. But a single R/W head can only read a single track, not parallel tracks, hence speed only increases if you pack more sectors into each track, not by increasing the number of tracks. That's why between today and 10 years ago performance in desktop or server drives only differs a little, compared to the capacity increase to 8+ TB in a 3.5" drive.
More platters also don't help in transfer rate, because the alignment of all heads on the actuator is fixed: at any given time only one platter and one R/W head is used (locked to the track). [More platters can help reduce seek time in certain scenarios though]
Higher density means more data per track, not just more tracks per disk. You get an entire track per revolution so a track with more data is more MBps. So linear reads on a higher density drive are faster, and semi-linear accesses (ie, reading two files that are next to each other) do get faster.
I remember reading a story about a guy who built a drive array with high capacity 7200 RPM drives that got within 20% of the performance of the 10K RPM setup they had, by partitioning the drives at the same capacity as the 10K equivalent. The head only had half as many tracks to traverse, so worst case access time was better, and the higher density made up for the lower RPMs.
Short stroking helps get more performance from the disk, at least in terms of latency. The bandwidth change occurs because the density is fixed and there are more blocks on the outer rings per track compared to the inner rings so in one revolution you can read more tracks and you don't need to change tracks so often.
You parent comment is right though, there are only small changes in bit density on the track in recent years so the bandwidth is not improving by much.
How about the stupid idea of making a minimum block size be a multiple of the number of platters to write to. The block is divided evenly amongst all platters and are always at the same parallel track/sector. That way you can multiply the read/write speeds to be a function of the number of platters.
Instead of 50-100MB/s, you can get 4-8x the speed in large linear transfers, which helps get dead racks back up faster and would work quite well in backblaze's backup model.
Your block sizes will be huge, but I think in backblaze's case that doesn't really matter so much.
You cannot read from multiple platters in parallel. When you are aligned to read from a certain track on a certain side of a platter, you are not necessarily aligned on all platters. This head mechanism is not accurate enough for that.
Google's paper on datacenter hard drives addresses this point -- it's easier to do this and other tricks with a set of disks than attempting it within the confines of a single drive.
You could probably do it if the platter diameter were reduced, which would then accordingly reduce capacity as well. At that point you could just as well use two drives and also get lower overall failure probability. Or you use SSD caches, or memory caches, ...
I was thinking of a Frankendrive with 5 actuators, which can do 5 parallel reads/writes, as long as they don't overlap. But the new dimensions would probably be the roadblock.
I was going to ask this too, but start with a simple 2nd voice coil on the opposite side of the spindle, keep the 3.5" form factor and just make it longer. My guess is the motion of one actuator would disrupt the airflow of the other, since we're talking about micrometer-ish (?) mechanical tolerances. This might be able to be mitigated by interleaving which side each arm wrote to. So the left arm's heads address the tops of the platters, the right side addresses the bottoms.
Perhaps another option is to go back to the 1980's hdd designs where the arm moves straight down the radius of the platter. This design might permit multiple heads on the same arm. I'm sure all this stuff has been researched thoroughly.
Either way, this doubles/triples the probability of mechanical failure
Perhaps another option is to go back to the 1980's hdd designs where the arm moves straight down the radius of the platter. This design might permit multiple heads on the same arm.
At my first job, sysadmin and programming on a PDP-11/44, 1-2 fascinating Winchester drives were procured for it. They had clear plastic covers, and you could see everything, the disk, the actuator, which had two heads, and it was by and large square, I'm almost positive it was rotary, not solenoid based straight stroke like many '70-80s drives, CDC's famous line in particular.
The Backblaze Vault design mitigates that as the "raid array" is scattered across 20 different Storage Pods in twenty different racks. You'd need more than three racks to go down before you would be offline. Backup systems in place make that highly unlikely. Andy at Backblaze.
AFAIK, if you physically destroy the contents of a drive with a strong magnetic field, you also destroy the servo tracks, rendering the drive useless. Modern hard drives can't "low-level format" themselves; they're not mechanically precise enough.
I don't know how that would be more awesome, essentially spending effort towards the upper middle-class hobbyists. I would rather see them sell them directly to the highest bidder and focus the effort on improving their business, or give the drives to charity.
They were reusing their 1TB drives to test new pods with. But I think they would only need a hundred or so for that, unless they die faster in that workload.
I guess my question is more about the long term implications of transfer speeds increasing much slower than hard drives space. And if there's a hidden risk to downtime because all of a sudden, a full rack of 32 or even 64TB drive will take day(s) to transfer as opposed to an hour or so because transfer speeds are so slow.
There are implications. ZFS' raid-z2 (which uses 2 disks for redundancy data) was pitched - in 2008 or so - with the idea that disks are now large enough that recovery can take long enough that it's likely that another disk in the raid set breaks before the redundancy is back.
I suppose if you had a single 32TB drive that went offline for say a week, and then once it came back online you'd have some type of pent up demand and a "slow" transfer speed. Storage systems in general spread the load across multiple devices so the effect of slow transfer speed is near zero in most backup and archiving applications and maybe more problematic in transactional applications.
That takes ~11 hours to fill. At 400x the density it would take 20x as long or 9 days. I don't think HDD drives are hitting 400x the density any time soon but if they did it would be a problem.
However, in an array you could take a month to fill a drive to 75% without causing to much trouble. Assuming you had enough drives. That's around a ~80PB limit drive. IMO, the real issue is it would take another month to download all that data. Relegating HDD firmly into archival storage.
PS: I don't think rust is going to get into those density's making this far less of an issue.
Heck, just a raid rebuild takes days nowadays if you dare use RAID5 still or, less risky, RAID6.
The smart move is to keep several smaller arrays instead of one big one. This lowers risk as well. I dont put in anything bigger than 7 disks into production. Past that I'm just asking for trouble. Its better to have 4 7 disk arrays than one 28 disk array. A drive fail means a quick rebuild and a restore is going to be 1/4 the time.
Stupid question but why does it take more time to rebuild a 10 disks array than a 6 disks array. I mean modern PCIe can go several GB/s and the disks are in parallel, so it should take the same time to rebuild irrespective of how many disks.
We have about 20TB in an AWS S3 bucket we'd like to backup somewhere separate from Amazon. Is there any chance of Backblaze offering ingestion from an Amazon Snowball export (https://aws.amazon.com/snowball/)?
I did, I picked HGST because of their reports. No problems so far.
When you buy hard disks ONLY buy from Amazon or Newegg directly - never buy from a 3rd party seller on their site. Especially for hard disks there is too much fraud, and for a hard disk especially the risk of data loss makes it just too risky (unlike other items).
Just bought some drives that were fulfilled via Amazon (but not sold by Amazon.com LLC themselves). You've given me something to check when they arrive.
I don't always buy the exact models, but I've switched to HGST few years ago after studying Backblaze reports and so far, no regrets (I'm running heavily used postgres on these drives at home)
I chose to get 4 ST4000DM000 drives based on previous reports. Sure HGST drives never die, but it's cheaper to RMA or buy a single new drive if one fails, than the added cost of 4 reliable drives. Assuming only one fails, which is a risk I'm willing to take with my very non-mission-critical data.
I don't read this as 'which drive to buy' but more as 'which drive not to buy'.
Disclaimer: I work at Backblaze. I know you didn't mean that as an absolute, but I just want to point out 100% of drives fail. It's my OCD that makes me point this out. We have NEVER found a drive that lasted forever. There are two types of drives: 1) those that have already failed, and 2) those that are about to fail. For any data you would be annoyed to lose, you need three copies in three locations with three separate vendors (three different pieces of software that don't share any lines of code).
18 disk zpool with three raidz2 vdevs, with 2 disks from 3 different vendors per vdev, bought in two batches per vendor and evenly spread across the case. That's about as paranoid as I consider practical for the homelab/online part of my storage needs.
Also, not sure there are three pieces of storage software that I trust and are readily available to me.
To be somewhat similarly OCD, if your software is controlling storage to each of those drives, then there are many shared LOC involved; indeed, this is the reason why async replication into separate clusters is always recommended for data which can't be lost.
Note to self: stay the hell away from Western Digital. Good lord, an 8.2% annualized failure rate? That's unbelievable. They have the first, second, and third worst failure rates on that chart.
A few years ago, you could have said the same for Seagate. Even worse, actually. Certain models had over 10% failure rate. Some WD models almost had HGST levels of relibility.
Hmmm. Curious. I wonder if Seagate has improved their production somehow to avoid those failure rates, because even with those previous very high rates, they fare much better on the 2013-2016 total chart.
They might have. But I also wonder if those drives were among the first after the 2011 Thailand floods, and if Seagate's factories were still a bit dirty after resuming production.
You should really consider moving this webpage and report to something like AWS S3 when you first release it. Then move back to your usual servers when traffic has fallen off. Your poor servers must melt down when this shows up on Hacker News and Slashdot.
Internally we're blaming our SEO people for putting to much crap on the blog itself ;) But yea, it's worth exploring - though we have our own servers that should be able to handle the load. We haven't had blog loading trouble in a while, so it'll be neat to debug this later :D
From the outside it looks like your running a fairly intensive Wordpress install on an Apache webserver with no page caching.
Also seems there's no minification or combining of stylesheets/js and there are query strings on those static assets which is going to discourage caching.
No wonder you need a datacenter to handle that kind of resource punishment!
There are plenty of reasons to stick with Wordpress in a decent sized corporation but if not switching to a static site at least stick W3TC on there so you're minimising your server load and serving out static html and minified/combined resources.
You could then consider using Varnish in front of Apache or maybe nginx with a FastCGI cache.
I"m sure you've got some folks in the team who could whip up a W3TC install in 10 minutes.
Because if it is it from the team that currently can't keep a blog post online when you get a few thousand concurrent visitors, so you might keep yourself open to suggestions and perhaps undertake the BASIC best practices of keeping a Wordpress site up under load.
If nothing else it shows a basic lack of planning for what you know to be a massively popular post, so turn a little of that judgement back on yourselves.
It's possible easily handle tens of millions of hits a day on a tiny VPS if you do even some basics right[1] and that was without any particularly extensive optimisation.
EDIT: I may not be allowed to reply to the comment below due to HackerNews restrictions so incase the option doesn't become available in the next while I'll just say I accept the answer below gracefully, withdraw my daggers and take a calming beer at the end of a long day :-)
I'm wish you continued success and look forward to the next post.
No, he was agreeing. We have a lot of projects on our map to shore up some of these types of issues, but our admins are in high demand, so some of the lower-priority tasks slip on occasion. Since we rarely have issues with the blog (today was an exception) it tends to be a "we know what we'd like to change, but we'll do it when we have time" type of silo on our website.
*Edit -> to your above edit -> I think if you expand the comment by hitting the "time submitted" link you can leave a reply, thus subverting HN :P
I'd suggest switching to a static site in general. I have no idea what kind of traffic they're sustaining at the moment but NGINX serving up static html (or s3) is a lot more efficient than Wordpress or another blog engine consuming cpu cycles.
Agreed. I'm always interested in the results , even if their physical usage of the drives is miles above anything I'm ever going to do with a hard drive.
IIRC 3TB drives were among the newer drives commercially available at retail when the flooding happened in Thailand and all retail drives were impacted negatively. Furthermore, aggregate capacity is only one factor in the design of a hard drive and most 3TB drives were made in a manner that reduces reliability (more platters of same capacity or fewer platters with greater individual platter capacity, can't quite remember which unfortunately). I don't see why a manufacturer would put the newest technology into product lines that are older so I'd presume that 3TB drives are among the greatest in number of platters and the extra components contributes to the failure rate.
Among the least reliable drives I saw in previous reports were Seagate 3TB drives (supposedly they had worse reliability than the legendary IBM Deathstars) and after reading about how 3TB drives were designed across manufacturers years ago during the flooding crisis I decided to avoid 3TB drives entirely. Seems like my decision is finally getting some data to back it up now in hindsight (no pun intended).
The smaller sample size is accounted for by the width of the confidence interval, which is 5.2% - 7.1%; even the low end of that still looks pretty bad.
Yev from Backblaze here -> have you checked out B2? We likely won't have a Linux client for our backup service any time soon, but our B2 service has a lot of integrators (like Cloudberry and Duplicity, HashBackup, etc..) that can back up Linux machines, a lot of folks have been going that route.
It's a mixture of a couple things. One is that we tend to run pretty lean and our engineers are all booked up for the foreseeable future. Linux users are a passionate community, but we can't quite justify the development time for a market segment that is not very large. Additionally, because we run an unlimited model, a lot of people would immediately sign up and back up their Linux servers for $5/month and we'd sail out of business. We could address that by putting in limits for those types of devices and only allowing certain Linux builds, but that adds complexity and we want to keep the backup side of the service very simple - which works for the vast majority of folks. So it's a combination of a bunch of factors. We hoped that developing B2 APIs and CLIs would give Linux folks something that they could use if they needed to have offsite backups or archives and wanted to use our infrastructure cause we're pretty neat. Long-winded answer, but TL:DR - small market segment, development time/cost, possible abuse.
It's worth considering though that the Linux compatibility is worth more than the actual market share. Our company (with roughly 90% Mac / 5% Linux / 5% Windows users) went with Crashplan to have a single backup solution for all of the employees.
Absolutely, and that makes perfect sense. Having one system in place definitely beats out multiples. We know we can't be all things for all folks, but Crashplan is great, no hard feelings ;)
Disclaimer: I work at Backblaze. The underlying base of original client backup software was originally written from scratch on three platforms simultaneously: 1) Windows, 2) Macintosh, and 3) Linux. It was designed that way from the beginning. This code continues to compile every time we do a client release, simply as part of the process. However, it is entirely lacking a GUI layer and an installer - those were never written. The underlying backup engine runs even when the user is logged out or the GUI has stopped working.
So it is technically possible, but along the way we released Backblaze B2 (storage API) which not only supports Linux, we assume Linux is the primary customer! We're seeing if that can satisfy the Linux community. Backblaze B2 is a large ongoing effort consuming a lot of our software developer's time.
A note about limited resources: Backblaze never really raised any funding, there are no deep pockets, so we can ONLY hire an additional programmer when the products we sell throw off enough money to pay that salary. We run on really tight margins (thus our obsession with failure rates of drives) which is fabulous for our customers, but not so great for hiring lots of extra help to do projects like a Linux GUI. :-)
Their MacOS client is amazing, so I doubt it's a technical constraint. My guess is that because they don't have size limits on their backups, if all of their customers are backing up terabytes, Backblaze bleeds money, so they depend on most people only backing up a couple hundred gigabytes. Linux users may have a wildly higher amount of stuff to back up such that it's not profitable. And changing their branding to have mostly unlimited except for some people is probably not too appealing either.
Probably because the market is too small to justify the developer time when other developers are willing to spend the time integrating their B2 cloud offering as a driver.
CrashPlan made a different decision. If I was to offer an unlimited backup service, I would offer it to novice users, not a small minority of power users and professionals who were going to break the business model with TBs of data.
Let's be honest: when people see unlimited, most think "I don't have to worry about how much I'm storing" but a small group thinks "How can I take advantage of this?"
> Let's be honest: when people see unlimited, most think "I don't have to worry about how much I'm storing" but a small group thinks "How can I take advantage of this?"
I don't want to take advantage of it. I just happen to have 8TB of data to back up...
But backing up that much data over the internet isn't practical in any case.
CrashPlan built their system using Java and offers enterprise versions of their clients and servers. They have a very different technology stack and business model.
Yeah, not being able to backup the Linux machine at my house means that I really don't want to bother with buying it on my Macs/Windows machines either - having two solutions is a pain :/
Just use the backblaze API in a script with a cronjob, that's what I do. It's linux, you're going to end up writing some code to get what you want, heh.
It's fantastic to see the confidence interval quoted in those latter tables (I assume 95%?) - that's far more informative than just the mean failure rate.
Yev from Backblaze here -> Tape tends to have much higher read times, and can even be more expensive than hard drives in some cases. Since we have Backblaze B2 the data needs to be highly available.