Redshift Spectrum – Exabyte-Scale In-Place Queries of S3 Data

georgewfraser · on April 19, 2017

This appears to be Amazon Athena / Presto, embedded into Redshift. The syntax for CREATE EXTERNAL TABLE is exactly the same, and the supported file formats and compression encodings are a subset of Athena. The big advantage of an approach like this versus running Redshift and Athena separately is that you can write a single query that joins data stored in Redshift with data stored in Athena.

idunno246 · on April 19, 2017

And you can actually materialize a query in redshift since Athena doesn't have create table as select. This is neat, though I wonder if opening Athena to connect to Postgres(presto can) and do the join inside Athena would have been better

openasocket · on April 19, 2017

I'd say that I have more confidence in the Redshift query planner and execution engine than that of presto. And having the option to join that with data stored in-memory on redshift nodes is very attractive.

vhold · on April 19, 2017

It's also the exact same cost, $5 per terabyte read.

awgupta · on April 20, 2017

The underlying tech stacks are totally different.

teddyknox · on April 19, 2017

When I think exabyte scale queries on a columnar datastore I think aggregations, but then I have this question: Why do we need to do exabyte scale queries in the first place? Wouldn't statistical inference via random sampling be faster and accurate enough?

(Granted, often times aggregations are happening after some filtering, at which point the relation being aggregated might be considerably smaller than exabyte scale.)

adwf · on April 19, 2017

Redshift is designed to fill the classic accounting datawarehouse role in an organisation. Whilst I'm sure there aren't too many companies with account ledgers that large (or any), I doubt too many accountants would be happy with statistical inference of their books... ;)

This new model of processing directly on S3 is pretty much aimed specifically at eliminating the "Load" part of the ETL process. Just dump to csv from whatever sources you originally had, and don't worry about the schema conversion/loading into a DB. The fact that it happens to scale to exabytes is just good marketing fluff.

teraflop · on April 19, 2017

Yeah, I think filtering is a big part of it. If you want to answer a statistical question about the entire dataset, then a random sample is probably good enough. If you want to drill down and do an analysis that only looks at a particular narrow slice of the data, then it's likely that the corresponding subset of your sample isn't big enough to be meaningful.

(You can pre-filter or pre-aggregate before sampling, but that assumes you know a priori what types of queries you'll want to do.)

awgupta · on April 20, 2017

it really depends on what you are doing. A large data set shouldn't be limited to longitudinal analysis. If you're storing every log record or every stock bid/ask, there may be times that you need to understand the specifics of what exactly was going on. There may be a lot of filtering on the underlying corpus for these sorts of exact match queries, but data set sizes continue to grow.

awgupta · on April 20, 2017

that said, I agree that approximate functions should be part of a modern database system. Redshift has approximate count distinct (based on hyperloglog) and approximate percentiles (based on quantile summaries)

sixdimensional · on April 20, 2017

See BlinkDb

jnordwick · on April 19, 2017

This seems really slow for queries especially when taking into account all the computing power being thrown at it:

Over 6 billion rows (not huge by modern standards), a relatively common aggregation query with 4 basic aggregates (2 sum, 2 avg), one where, and two group by clauses, over 1 table (no joins) takes about 4.25 minutes (254.650 seconds).

On some column databases on good hardware with a single machine you can probably get a couple seconds, probably faster.

makmanalp · on April 19, 2017

So this is a good question - the answer is that you're comparing apples to oranges. This is all about in-situ data processing, which basically means that you can run reasonably efficient, optimized queries on just regular old files that you have lying around, without having to do anything special. This is as opposed to a situation where you've spent a ton of time, effort and expense actually ingesting the data into whatever data system you're running (in this case redshift). So it's usually said that these systems are good for ad-hoc queries, i.e. one-off queries on files and datasets that you want to explore more, but it doesn't make sense to invest the hours upon hours of waiting and storage / cpu resources to bring them into your database just to make a few queries.

jnordwick · on April 19, 2017

So these ingested files are JSON or something semi-structured? I saw how the table was created, so maybe is was doing the schema-less key-value pair thing on table creation and running queries against that? I'm sure this is all in the docs somewhere, but while I have a lot of big db experience, I have about 2 hours of AWS.

makmanalp · on April 20, 2017

So in this example the image says "lineitem_1t_parq" which makes me think that's probably parquet files (optimized and perhaps compressed columnar file format).

But it does the same stuff with CSVs too, perhaps a bit less efficiently. JSON is also definitely possible, but a bit more tricky because of the inherently nested structure and more complex format.

Take a look at this for the details on a similar idea: http://stratos.seas.harvard.edu/files/stratos/files/nodb-cac...

kakoni · on April 20, 2017

That paper discusses PostgresRAW, which can be found here; https://github.com/HBPMedical/PostgresRAW

I wonder if there are any similar projects using psql as an engine?

makmanalp · on April 20, 2017

I think it's pretty common - it's open source, nice and modular codebase, reasonably advanced execution engine with stuff like MVCC. But also redshift itself, citus, hadapt and greenplum are examples that come to my mind.

sunsu · on April 20, 2017

The files are anything separated by line. Log files, CSV exports, anything. No specific format required.

jnordwick · on April 20, 2017

So this is like a distributed awk, reparsing on every query?

teej · on April 20, 2017

Just for comparison's sake, I recreated Jeff's example queries on my in-house table that's ~8B rows. COUNT(*) takes 6s and the SELECT with 2 groups, 2 sums, and 2 averages takes 52s. The downside is that the table takes 600GB of disk space compressed which is a major chunk of our cluster. But here's the kicker - I would never query this table for data I needed in under 5 min so my current performance is meaningless.

Given our access patterns for this table, I'm going to investigate using Redshift Spectrum for it. It seems like a huge win for us.

andrioni · on April 20, 2017

>The downside is that the table takes 600GB of disk space compressed which is a major chunk of our cluster.

And that it probably takes one hour or more to fully reload it if you ever need to.

jnordwick · on April 20, 2017

What are you using?

teej · on April 20, 2017

Sorry, this is just vanilla redshift.

afpx · on April 19, 2017

This is an interesting problem space. More and more companies are dealing with much more data than they've ever had to process. Off the top of my head, I know a handful of people whose companies are dealing with 10s of trillions of rows of data. But, they don't have the staff on hand to help them analyze and develop custom indices to improve analysts' query time.

So, something like this is much better (and cheaper) than nothing (and, yes, aws makes a killing). There's gonna be a lot of money to be made by the smart people who help crack this problem.

ah- · on April 19, 2017

I wonder where the bottleneck was. Right below that result they have the performance plot that shows how it used up almost no CPU or network bandwidth. I'd expect S3 to be faster than this?

FWIW, I've been playing with parquet and Kudu lately and on a reasonably sized single machine you can run similar queries on similar scale data in <5 seconds.

awgupta · on April 20, 2017

The constraint for Redshift Spectrum queries tends to be between S3 and Spectrum. The graphs Jeff shows are in the Redshift cluster itself which, for the sample query he is running, is lightly loaded. We are working on upgrading our performance dashboards...

aub3bhat · on April 19, 2017

On a dataset where all values are 0, the answer can be found in 10 nanoseconds, if not faster. What's your point?

filereaper · on April 19, 2017

Very BigQuery like system.

Has anything been sacrificed for this type of scale-out?

BQ and other Dremel based systems are weak at joins for star based schemas, and frequently advise denormalization of data.

scribu · on April 19, 2017

> Very BigQuery like system.

What makes you say that?

The main innovation in BigQuery was the ability to store and query nested data.

Amazon Redshift doesn't support querying nested data. It only has some convenience functions for loading flat data from nested JSON files hosted on S3.

And what I assume Spectrum does is just perform that loading step behind the scenes.

puzzle · on April 19, 2017

At a first read, it sounded like you didn't need to run a cluster ("Spectrum scales to thousands of instances"), which is BigQuery's big advantage. Later, it says that you still need a cluster, but it's only used in the final processing stages. So it looks like it's 3/4 of the way to BQ.

johns · on April 19, 2017

Can anyone clarify the when or why you would use this instead of or with Athena?

bladeaod · on April 19, 2017

It seems to me that this is a bridge between Redshift/S3, so you can join data from both sources. I believe Athena is S3 only

agacera · on April 20, 2017

If you are already invested in Redshift this is a really nice feature to have.

I used to work in an environment where only data for the last few months was stored in Redshift (to save costs since storage is expensive). Whenever someone need old data, we need to make room for it unloading tables to S3. Now this is not needed anymore, and is is awesome.

Another nice reason is that some BI tools doenst work with Athena (and PrestoDB) yet (Metabase and Pentaho for instance), so it is now viable to use Redshift to expose data inside S3 to these tools.

nside · on April 20, 2017

Great PR!

How much does it cost to store 1 exabyte on S3?

1 exabyte = 1,000,000,000 GB

Cost of storage 1GB on S3 in us-west-2 = $0.024

That's 24 millions dollars. What am I missing?

Dunedan · on April 20, 2017

Fun fact: Using any other public US AWS region would save 3 million dollars per month!

Aside from that: Using infrequent access storage for the parts of the data which don't get frequently accessed would save a lot and I'm pretty sure at that scale AWS would be happy to discuss possible discounts as well.

sudhirj · on April 20, 2017

Don't think that's disingenuous, if that's what you're implying. Companies that have an exabyte of data usually know how much it's costing them to have it.

Jweb_Guru · on April 20, 2017

> What am I missing?

Nothing. Any query processing on sufficiently large amounts of data is going to be expensive in time, space, energy, and money, and Amazon doesn't buy custom hardware for this purpose and intends to make a profit doing it so it's going to be even more expensive.

manigandham · on April 20, 2017

An exabyte is an incredible amount of data. It will always cost millions of dollars to store it safely.

kermatt · on April 19, 2017

Does anyone know what this is based on? As a fork of Postgres v8, I would not expect it to be a foreign data wrapper.

filereaper · on April 19, 2017

Looks like Google BigQuery which is based on the Dremel paper.

jeffbarr · on April 19, 2017

This is not BigQuery.

puzzle · on April 19, 2017

It's not, but it resembles BQ a lot in the central part of the query execution, where it transparently performs work on your behalf on thousands of other machines that you don't have to be aware of. Is that correct?

openasocket · on April 19, 2017

1. No, you still need a Redshift cluster, which performs the work. A better example would be something like Athena.

2. That's not really a big similarity. Having many machines coordinate for data processing is a really common thing. So if BQ and Redshift Spectrum are similar, than so is Athena, Presto, any mapreduce-based system, Spark, etc.

puzzle · on April 19, 2017

My main point is that there is work behind the scenes happening that you don't have to worry about, on an undefined number of machines that you don't have to care about. Yes, there is still a cluster which does some kind of planning and post-processing now, plus there is obviously a lot of coordination, but it's not like Hadoop or classic Redshift, where you are constrained only to the hardware that you paid for and set up. It's the provisioning in the central phase that resembles BQ the most.

openasocket · on April 19, 2017

In that case Athena is more what you're talking about. Redshift Spectrum is weirder. You still have to have a Redshift cluster you provision and pay for, but for some S3 processing it calls out to other, independent, servers. I guess I'd consider that a hybrid approach.

filereaper · on April 19, 2017

Jeff,

Could you highlight the differences?

nieksand · on April 19, 2017

From the blog post, it's very unclear what the positioning is between Redshift Spectrum and Athena.

buremba · on April 19, 2017

ORC support?

jeffbarr · on April 20, 2017

We plan to add additional formats over time. I know that ORC is on the list.

fs111 · on April 19, 2017

what for if you have parquet? orc feels like a "me too" technology from Horton

buremba · on April 19, 2017

We have 300tb of orc data and can't convert it to Parquet. Also, ORC performs better for our use-case.

hcoyote · on April 19, 2017

What, specifically, is preventing the conversion? I've converted hundreds of TBytes to Parquet (including moving away from that HIVE ACID stuff).

buremba · on April 19, 2017

Mainly the development cost. We trust OCR and have good experience with it, it's stable, fast and compact. We don't have any experience with Parquet and it's just too hard/risky to make this kind of conversion on a production system.

idunno246 · on April 19, 2017

parquet is much less well supported by presto. Facebook is ORC, so ORC gets implemented and optimized first