Introduction

With the advancements in performance and accessibility the cloud has seen in recent years it is no longer necessary to build data warehouses using large monolithic databases and rigid schemas. It is now possible to simplify your operations, consolidate your data storage, and use one solution as your single source of truth. Amazon S3, AWS Glue, and Amazon Athena are all AWS services that can be used to implement such a solution. In this series of articles we will give a brief overview of what a modern data warehouse can look like, what the key components are, and how each of these services can be used to create a basic implementation.

Data Warehouse vs. Data Lake vs. Data Lakehouse

A data warehouse can have many meanings depending on the specific use case but it can be generalized as a location for structured and reliable data that can be queried to gain insights. From Databricks, an industry leader in managed cloud data analytics services:

A data warehouse is a central repository of information that provides users with current and historical decision support information which is difficult to access or present in the traditional operational data store.

(https://databricks.com/glossary/data-warehouse)

A data lake, on the other hand, is a sort of dumping ground for any data that may be of use to the company, with no expectations of format, quality, or access. From Databricks:

A data lake is a central location, that holds a large amount of data in its native, raw format, as well as a way to organize large volumes of highly diverse data. Compared to a hierarchical data warehouse which stores data in files or folders, a data lake uses a different approach; it uses a flat architecture to store the data.

(https://databricks.com/glossary/data-lake)

Traditionally these would be separate entities but with the cloud computing services available today it is possible to consolidate these into a single solution called a data lakehouse. From Databricks:

Data lakehouses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse, directly on the kind of low-cost storage used for data lakes.

(https://databricks.com/glossary/data-lakehouse)

The lakehouse shared architecture approach offers flexibility and compatibility between data lakes and data warehouses, and can reduce operational cost and complexity.

Data lakehouse benefits

Here is a brief overview of some of the key benefits the lakehouse approach offers.

Simplicity

While more complexity arises as you scale up, the core of a lakehouse consists of simple object stores, such as Amazon S3. These object stores can be used for many different use cases and so complex niche technologies are avoided.

Flexibility

Due to the usage of simple object stores, lakehouses give you the flexibility to choose your own rules, from free-for-all unstructured data to tightly regulated schemas with ACID transactions.

Scalability

If you are a small company that is just starting out, the object store foundation of a lakehouse and decoupled compute mean it can be a very affordable solution. On the other hand, if you are a large company looking to scale to terabytes or petabytes of data, object stores such as S3 offer unlimited storage. Additionally, the decoupled compute offers both unlimited scalability and cost savings.

Reliability and durability

Serverless object stores such as S3 can be extremely reliable and durable. S3 offers 99.99% reliability, and incredibly offers 99.999999999% durability (11N). The S3 FAQ page puts that into perspective:

For example, if you store 10,000,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000 years.

(https://aws.amazon.com/s3/faqs/)

If protecting against data loss is a crucial requirement for a storage system then using a cloud based object store can be a great choice.

Decoupling compute from storage

One of the key motivations for pursuing a modern cloud-based data warehouse architecture is the ability to decouple compute from storage. With traditional solutions you need to scale your data storage and compute linearly. This makes sense for standard use cases, but for data lakes and data warehouses, the amount of data stored is often vastly disproportionate to the frequency at which it is retrieved. By storing data in cheap object storage you can avoid paying for large servers that are idle most of the time.

Conclusion

The lakehouse approach can be a great way to design your cloud-native data lake and data warehouse storage solution. By taking advantage of the new technologies and serverless offerings from cloud providers, you can create a simple, scalable, and reliable data storage solution that can act as your single source of truth. In the next article we will look at what AWS services can be used to implement a lakehouse in the cloud.

View Part 2 here: https://www.xerris.com/insights/building-modern-data-warehouses-with-s3-glue-and-athena-part-2/