Saturday, January 22, 2022
HomeBig DataServerless at re:Invent: The place ought to Amazon Redshift go?

Serverless at re:Invent: The place ought to Amazon Redshift go?


A key highlight from last week's re:Invent was the extension of serverless compute to a swath of AWS analytics services, including Amazon EMR, Kinesis Data Streams, MSK (Managed Service for Kafka), and Redshift. For cloud analytics, AWS was not the first to offer serverless options, as Google Cloud BigQuery and Azure Synapse Analytics have long offered serverless options (by contrast, Snowflake's is still in preview).

Serverless wasn't the only new feature announced last week. AWS also announced the preview of automated materialized views that treats the creation of these views much like cost-based query optimizers: it automatically generates the views based on data hot spots. Nonetheless, serverless grabbed the limelight.

While AWS's serverless announcements could be viewed as keeping up with the Joneses, regarding Amazon Redshift, it is part of a larger narrative of the data warehousing service not only catching up, but getting in position to potentially bypass its rivals.

To recap, Amazon Redshift has long been known more as a market rather than a technology leader. 

When AWS launched Redshift back in 2013, it was one of the first cloud data warehousing services. Starting with technology acquired from ParAccel, AWS profited but also paid the price for being among the first to market. Its early entry, along with the portfolio of other AWS analytics services, enabled Redshift to carve a large client roster with greater than tens of thousands of customers today.

AWS forked the acquired ParAccel technology. But from the get-go, it followed a conventional data warehousing architecture with locally attached storage. By contrast, Google Cloud BigQuery, launched back in 2010, pioneered the cloud-native, data warehouse. Nonetheless, it was the launch of Snowflake in 2014 that really put the elastic cloud data warehouse on the map.

For last week's serverless announcement, the key development was the launch of RA3 instances back in 2019. They provided the long-sought elasticity with separation of compute and storage and paved the way for serverless. As it turns out, RA3 is the transformation that also allowed Redshift to do far more. Earlier this year, AWS released Advanced Query Accelerator (AQUA) for Amazon Redshift that we characterized at the time as a “generational shift” that leveraged the elasticity of the RA3 instances. It was aimed at workloads for “near-line” data sitting remotely on Amazon Redshift Managed Storage, storing hot data in SSD while using the Nitro hypervisor and FPGAs to accelerate processing of cooler data sitting on S3.

Incidentally, in our post last spring, we put serverless on our wish list for what we wanted to see next. Once in a blue moon, we occasionally get it right.

But there's more. Because RA3 instances pool much of the data in S3, that cleared the way for data sharing, which was initially released back in the spring for customers with multiple AWS accounts. At re:Invent last week, that capability was extended across multiple regions. Again, AWS wasn't first to market. For instance, Snowflake has been promoting various forms of data sharing since it started talking Data Sharehouse back in 2017 (they no longer use that term). AWS did launch a data marketplace (called Amazon Data Exchange) several years ago, but only just extended it to Redshift.

Let's make a couple of disclaimers. First of all, don't confuse data sharing with federated query. Redshift can remote query data sitting in RDS and Aurora databases for MySQL and PostgreSQL, and via Redshift Spectrum, to EMR and S3. But that's quite similar to what Google already offers with BigQuery. Secondly, don't believe that AWS is abandoning provisioned instances – it will keep offering them for Redshift as well because there are customers who prefer level billing. Google eventually learned that when it subsequently introduced flat-rate slots for BigQuery.

With cloud-native architecture and serverless support, AWS has some opportunities to score some firsts. With cloud-native serverless architecture, AWS could move more analytic and AI processing in-database.

But in-database machine learning has already become table stakes for cloud data warehouses. AWS already does so with Redshift ML, where you can use SQL commands to trigger developing models in SageMaker, then bring the models in-database as a form of user-defined function (UDF) to run training and/or inference workloads. In turn, Google also provides in-database ML for BigQuery, but it is limited to specific, curated models; while Microsoft allows running of ML models within Azure Synapse Spark pools. And with Snowpark, you can use non-SQL languages to push down processing, such as ML models, as UDFs directly into the Snowflake database.

Our wish list is to bring Spark directly into Redshift. Today, you'd have to fire up a separate EMR cluster to run Spark (but at least now, it could also be triggered serverless as well). Of course, there's nothing preventing AWS from breaking out Spark as a separate serverless service, just as Google Cloud recently did. But today, Azure Synapse Analytics lets you run a curated (subset) version of Spark in-database without firing up a separate cluster; we'd like to see AWS follow through.

But let's not stop there. Serverless also provides the opportunity to fire up workloads with third-party tools, especially with BI reporting and visualization. Redshift currently has integrations with its own QuickSight and with popular tools like Tableau, but you have to move data and process it in separate clusters.

So let's cut to the chase. We'd love to see AWS add a “Redshift-native” mode for third parties willing to run capabilities like ELT or visualization as containerized microservices that run directly inside Redshift RA3 compute nodes, or whatever next-generation nodes come out in future years. By comparison, Snowflake provides common APIs for third parties to access Snowflake data, but the data is processed in separate clusters. Imagine running an ELT service from Informatica or Fivetran as a microservice in a Redshift compute node. AWS could then promote Redshift as the cheapest, fastest data warehouse in the cloud.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments