Today, many organisations have adopted data lakes in order to manage increasingly large volumes of data. Although data lakes are useful at storing data before analysis, they also tend to bring a number of challenges.
Challenges associated with harnessing data
As a recent whitepaper from Databricks observes, query performance can cause issues when using data lakes. "The required ETL processes can add significant latency such that it may take hours before incoming data manifests in a query response so the users do not benefit from the latest data", it read. Moreover, increasing scale and the subsequent longer query run times can prove "unacceptably long" for users. In addition to this, complex data pipelines are prone to error and can therefore be unreliable. It is also challenging to build flexible data engineering pipelines that include streaming and batch analytics, which require complex and low-level code. Meanwhile, "interventions during stream processing with batch correction or programming multiple streams from the same sources or to the same destinations are restricted."
A simpler analytics architecture
"Practitioners usually organise their pipelines using a multi-hop architecture", according to Databricks. Nevertheless, many data engineering professionals encounter a number of issues through the pipeline stages. However, Databricks Delta addresses these challenges by introducing a much simpler analytics architecture. In effect, this system tackles both batch and stream use cases alongside "high query performance and high data reliability." The Delta architecture also provides an efficient and transactional method of handling large data sets stored as files on S3. In order to do so, Delta employs an ordered log (the Delta Log) of atomic collections of actions such as "AddFile" or "RemoveFile."
Addressing data challenges
Delta also uses a number of techniques in order to address query performance, data reliability, and system complexity. This is vital as query performance is a "major driver of user satisfaction." In addition to this, Delta employs various methods to achieve data reliability, such as the “all or nothing” ACID transaction approach. Reliable datasets are absolutely integral when it comes to ensuring successful data analytics and data usage. Finally, Delta simplifies the process by handling both batch and streaming data. As Databricks notes, system complexity is a "key determinant not only of reliability and cost-effectiveness but very importantly, also of responsiveness."
Looking to enhance your data strategy? Listen to our podcast with renowned data coach and strategist Lillian Pierson for some invaluable insights