This appears to be Amazon Athena / Presto, embedded into Redshift. The syntax for CREATE EXTERNAL TABLE is exactly the same, and the supported file formats and compression encodings are a subset of Athena. The big advantage of an approach like this versus running Redshift and Athena separately is that you can write a single query that joins data stored in Redshift with data stored in Athena.
And you can actually materialize a query in redshift since Athena doesn't have create table as select. This is neat, though I wonder if opening Athena to connect to Postgres(presto can) and do the join inside Athena would have been better
I'd say that I have more confidence in the Redshift query planner and execution engine than that of presto. And having the option to join that with data stored in-memory on redshift nodes is very attractive.
When I think exabyte scale queries on a columnar datastore I think aggregations, but then I have this question: Why do we need to do exabyte scale queries in the first place? Wouldn't statistical inference via random sampling be faster and accurate enough?
(Granted, often times aggregations are happening after some filtering, at which point the relation being aggregated might be considerably smaller than exabyte scale.)
Redshift is designed to fill the classic accounting datawarehouse role in an organisation. Whilst I'm sure there aren't too many companies with account ledgers that large (or any), I doubt too many accountants would be happy with statistical inference of their books... ;)
This new model of processing directly on S3 is pretty much aimed specifically at eliminating the "Load" part of the ETL process. Just dump to csv from whatever sources you originally had, and don't worry about the schema conversion/loading into a DB. The fact that it happens to scale to exabytes is just good marketing fluff.
Yeah, I think filtering is a big part of it. If you want to answer a statistical question about the entire dataset, then a random sample is probably good enough. If you want to drill down and do an analysis that only looks at a particular narrow slice of the data, then it's likely that the corresponding subset of your sample isn't big enough to be meaningful.
(You can pre-filter or pre-aggregate before sampling, but that assumes you know a priori what types of queries you'll want to do.)
it really depends on what you are doing. A large data set shouldn't be limited to longitudinal analysis. If you're storing every log record or every stock bid/ask, there may be times that you need to understand the specifics of what exactly was going on. There may be a lot of filtering on the underlying corpus for these sorts of exact match queries, but data set sizes continue to grow.
that said, I agree that approximate functions should be part of a modern database system. Redshift has approximate count distinct (based on hyperloglog) and approximate percentiles (based on quantile summaries)
This seems really slow for queries especially when taking into account all the computing power being thrown at it:
Over 6 billion rows (not huge by modern standards), a relatively common aggregation query with 4 basic aggregates (2 sum, 2 avg), one where, and two group by clauses, over 1 table (no joins) takes about 4.25 minutes (254.650 seconds).
On some column databases on good hardware with a single machine you can probably get a couple seconds, probably faster.
So this is a good question - the answer is that you're comparing apples to oranges. This is all about in-situ data processing, which basically means that you can run reasonably efficient, optimized queries on just regular old files that you have lying around, without having to do anything special. This is as opposed to a situation where you've spent a ton of time, effort and expense actually ingesting the data into whatever data system you're running (in this case redshift). So it's usually said that these systems are good for ad-hoc queries, i.e. one-off queries on files and datasets that you want to explore more, but it doesn't make sense to invest the hours upon hours of waiting and storage / cpu resources to bring them into your database just to make a few queries.
So these ingested files are JSON or something semi-structured? I saw how the table was created, so maybe is was doing the schema-less key-value pair thing on table creation and running queries against that? I'm sure this is all in the docs somewhere, but while I have a lot of big db experience, I have about 2 hours of AWS.
So in this example the image says "lineitem_1t_parq" which makes me think that's probably parquet files (optimized and perhaps compressed columnar file format).
But it does the same stuff with CSVs too, perhaps a bit less efficiently. JSON is also definitely possible, but a bit more tricky because of the inherently nested structure and more complex format.
I think it's pretty common - it's open source, nice and modular codebase, reasonably advanced execution engine with stuff like MVCC. But also redshift itself, citus, hadapt and greenplum are examples that come to my mind.
Just for comparison's sake, I recreated Jeff's example queries on my in-house table that's ~8B rows. COUNT(*) takes 6s and the SELECT with 2 groups, 2 sums, and 2 averages takes 52s. The downside is that the table takes 600GB of disk space compressed which is a major chunk of our cluster. But here's the kicker - I would never query this table for data I needed in under 5 min so my current performance is meaningless.
Given our access patterns for this table, I'm going to investigate using Redshift Spectrum for it. It seems like a huge win for us.
This is an interesting problem space. More and more companies are dealing with much more data than they've ever had to process. Off the top of my head, I know a handful of people whose companies are dealing with 10s of trillions of rows of data. But, they don't have the staff on hand to help them analyze and develop custom indices to improve analysts' query time.
So, something like this is much better (and cheaper) than nothing (and, yes, aws makes a killing). There's gonna be a lot of money to be made by the smart people who help crack this problem.
I wonder where the bottleneck was. Right below that result they have the performance plot that shows how it used up almost no CPU or network bandwidth. I'd expect S3 to be faster than this?
FWIW, I've been playing with parquet and Kudu lately and on a reasonably sized single machine you can run similar queries on similar scale data in <5 seconds.
The constraint for Redshift Spectrum queries tends to be between S3 and Spectrum. The graphs Jeff shows are in the Redshift cluster itself which, for the sample query he is running, is lightly loaded. We are working on upgrading our performance dashboards...
The main innovation in BigQuery was the ability to store and query nested data.
Amazon Redshift doesn't support querying nested data. It only has some convenience functions for loading flat data from nested JSON files hosted on S3.
And what I assume Spectrum does is just perform that loading step behind the scenes.
At a first read, it sounded like you didn't need to run a cluster ("Spectrum scales to thousands of instances"), which is BigQuery's big advantage. Later, it says that you still need a cluster, but it's only used in the final processing stages.
So it looks like it's 3/4 of the way to BQ.
If you are already invested in Redshift this is a really nice feature to have.
I used to work in an environment where only data for the last few months was stored in Redshift (to save costs since storage is expensive). Whenever someone need old data, we need to make room for it unloading tables to S3. Now this is not needed anymore, and is is awesome.
Another nice reason is that some BI tools doenst work with Athena (and PrestoDB) yet (Metabase and Pentaho for instance), so it is now viable to use Redshift to expose data inside S3 to these tools.
Fun fact: Using any other public US AWS region would save 3 million dollars per month!
Aside from that: Using infrequent access storage for the parts of the data which don't get frequently accessed would save a lot and I'm pretty sure at that scale AWS would be happy to discuss possible discounts as well.
Don't think that's disingenuous, if that's what you're implying. Companies that have an exabyte of data usually know how much it's costing them to have it.
Nothing. Any query processing on sufficiently large amounts of data is going to be expensive in time, space, energy, and money, and Amazon doesn't buy custom hardware for this purpose and intends to make a profit doing it so it's going to be even more expensive.
It's not, but it resembles BQ a lot in the central part of the query execution, where it transparently performs work on your behalf on thousands of other machines that you don't have to be aware of. Is that correct?
1. No, you still need a Redshift cluster, which performs the work. A better example would be something like Athena.
2. That's not really a big similarity. Having many machines coordinate for data processing is a really common thing. So if BQ and Redshift Spectrum are similar, than so is Athena, Presto, any mapreduce-based system, Spark, etc.
My main point is that there is work behind the scenes happening that you don't have to worry about, on an undefined number of machines that you don't have to care about.
Yes, there is still a cluster which does some kind of planning and post-processing now, plus there is obviously a lot of coordination, but it's not like Hadoop or classic Redshift, where you are constrained only to the hardware that you paid for and set up. It's the provisioning in the central phase that resembles BQ the most.
In that case Athena is more what you're talking about. Redshift Spectrum is weirder. You still have to have a Redshift cluster you provision and pay for, but for some S3 processing it calls out to other, independent, servers. I guess I'd consider that a hybrid approach.
Mainly the development cost. We trust OCR and have good experience with it, it's stable, fast and compact. We don't have any experience with Parquet and it's just too hard/risky to make this kind of conversion on a production system.