A glimpse into the future of open data architecture

(Evannovstro / Shutterstock)

Hadoop may no longer function as a data platform, but today it laid the foundation for an open data architecture that continues to grow and evolve, primarily in the cloud. A recent underground conference gave us a glimpse into the future of this open data architecture. The conference featured the creators of several promising technologies for data lakes and data lake houses.

Much of the exciting work in today’s data architecture is done in the cloud. The availability of endless object-based storage (such as S3) and unlimited on-demand computing (thanks to Docker and Kubernetes) greatly eliminates the physical limitations of collecting, storing, and processing large amounts of data. (Also, new cost concerns have been introduced, but that’s another topic for another day).

When one problem is resolved, a new problem usually arises. In this case, storage and computing have been “resolved” and the focus is on the best way to make this data accessible and available to the largest user groups in the most influential way. For a variety of reasons, this is not a resolved issue. This is especially true for the fast-growing big data environment. Attempts to pigeonhole legacy data management technologies and techniques into this new cloud data paradigm have been successful.

In short, now that the new cloud era of data has arrived, we believe we need new tools and technologies to take advantage of it. This is exactly what a new generation of engineers advocating open data tools that work with open data architectures want.Also a cloud analytics vendor Doremio Focused on that Underground live concert, Virtually held in late July.

In the underground panel on the future of open data architecture, Gartner Analyst Sanjeev Mohan talked about the future with four people creating these technologies, including Wes McKinney, creator of Pandas, co-creator of Apache Arrow. Ryan Blue, creator of the Iceberg table format; Julian Le Dem, co-creator of parquet. Ryan Murray, co-creator of Nessie.

“It’s very exciting to see the journeys we started with open source decades ago together seem to come together,” Mohan said. “We finally seem to be in a position to help build a series of mutually complete open source projects and end-to-end solutions in an open data architecture.”

Take Apache Iceberg, for example. This technology was originally developed by Netflix and Apple engineers to address the performance and ease of use challenges when using Apache Hive tables. While Hive is just one of the SQL analytics engines, the Hive metastore remains the de facto glue that connects the data stored in HDFS and S3 with the latest SQL engines such as Dremio, Presto and Spark. I am.

Unfortunately, the Hive metastore doesn’t work well in a dynamic big data environment. Changes to the data need to be adjusted, which can be a complex and error-prone process. If not done correctly, data can be corrupted. Iceberg provides support for atomic transactions as an alternative to Hive tables. This allows the user to guarantee its accuracy.

But that wasn’t enough. As we have learned, when one problem is resolved, another tends to pop up. For Project Nessie, we needed to provide version control for data stored in tabular formats such as Iceberg.

“When we started thinking about Project Nessie, we really started thinking about the advances in the data lake platform over the last 10 or 15 years,” said Murray, a Dremio engineer. “We have seen people [slowly]… Build an abstraction, whether it’s a computationally useful abstraction or an abstraction such as a table or data file. We started thinking, what is the next abstraction? What makes the most sense? “

For Murray, the next abstraction needed was a catalog placed on top of a table format to facilitate better interaction with downstream components.

“As Ryan Blue felt that Aache Hive wasn’t suitable for table formats, it’s very difficult to scale with a single point of failure, a huge number of API calls to its metastore, and even Thrift endpoints. It’s really hard to use effectively, especially in a cloud-native way, “says Murray. “So we were looking for something that would be cloud native and work with the latest table formats, and we could start thinking about extending it to all the other great things my panel is building. It’s done. “

As One of the most popular big data formats, Parquet is another technology originally developed for Hadoop, but since it can be used in the cloud object store, it continues to be widely adopted even after Hadoop is no longer adopted. The column format allows users to perform demanding analytic queries. TeradataWith distributed file system compression and native support, it works with modern big data clusters.

Le Dem twitter, Hadoop or Vertica.. Hadoop can scale to fit big datasets, but it lacked performance for demanding queries. Vertica was able to handle ad hoc queries with excellent performance, but not big data.

“We were always in the middle of the two options,” LeDem said. “And I think some of them made Hadoop like a warehouse. Start from the bottom up, start with columnar presentations, and follow the tracks in those columnar databases to improve performance.”

Parquet has seen tremendous adoption, but there are still fundamental limitations to what it can do. “Parquet is just a file format,” says Le Dem. “It improves the performance of the query engine, but it doesn’t handle how to create the table, how to do all of this, etc. So we needed a layer on top. This is happening in the community. It was great to see them. “

This will show Apache Arrow, which was co-developed by McKinney and is also being developed by LeDem. Arrow’s contribution to the open data architecture is to provide a very fast file format for sharing data between large collections of systems and query engines. That heterogeneity is characteristic of open data architectures, Le Dem said.

“One of the driving forces behind this open storage architecture is that people don’t just use one tool,” says LeDem. “they [use] Something like a spark, they use something like a panda. Use warehouses, or SQL-on-Hadoop types such as Dremio and Presto, as well as other proprietary warehouses. So, although there is a lot of fragmentation, we still want to be able to use all these tools and machine learning with the same data.Therefore, it has this common storage layer [Arrow] It makes a lot of sense to standardize this so that you can create and transform data from a variety of sources. “

The need for Arrow arose in the midst of Hadoop’s hype cycle. “About six years ago, we realized that the community developed Parquet as an open standard for data storage and data warehousing in the data lake and Hadoop ecosystem,” said McKinney.

“But there is an increasing heterogeneity between applications and programming languages. If you need to like your application, you can move between programming languages, between application processes, and more like Parquet to move around. There are more and more bottlenecks in moving large amounts of data through expensive intermediaries. Data between two different steps in an application pipeline is very expensive, “he continued.

McKinney, who recently folded Ursa Computing To his new startup Voltron DataToday, I’m working on Arrow Flight, a high-speed data transfer framework on top of it. gPRC, Remote Procedure Call (PRC) technology that acts as a protocol buffer for distributed applications. According to McKinney, one of Arrow Flight’s extensions will eventually replace JDBC and ODBC, enabling full-scale high-speed data conversion.

In the future, technologies such as Arrow, Iceberg, Nessie, and Parquet will be incorporated into the data ecosystem, enabling a new generation of productivity among developers and engineers responsible for building data-driven applications. Murray said.

“Many data engineers I’m involved with have a way to make sure the size of the Parquet file, which directory it belongs to to utilize the partition, and that it has the proper schema and all this kind of stuff. I’m thinking about. “He said. “And I think we’re ready to stop talking about it, so engineers can start writing SQL and applications in addition to these.”

Tomer Shiran, CTO of Dremio, said in a Surface keynote that freedom of choice is a hallmark of open data lake architectures.

“You can choose the best engine for your particular workload,” says Shiran. “Not only that, but in the future, when new engines are created, they will also be available for selection. Start the new engine and point it at the data, the open source Parquet file, or the open source Iceberg table. It’s very easy to start querying and modifying that data. “

Open data lakes and lakehouses are gaining a lot of attention in the market, and Dremio CEO Billy Bosworth predicts that these technologies will make them major architectures in the future.

“With these architectural changes, as we see today, from classic relational database structures to these open data lake architectures, these types of changes tend to last for decades,” Bosworth said in an underground session. I mentioned in. “Our engineers and architects are building that future for all of us. It’s a future with easy access to things, and a future where data arrives faster to increase the value of data is fast. It’s increasing, and it does so in a way that allows people to have the best choice for the type of service they want to use for that data. “

Related products:

Apache Iceberg: A new data services ecosystem hub?

Do Customers Want an Open Data Platform?

Comparison of open source value for the future of big data

A glimpse into the future of open data architecture

Source link A glimpse into the future of open data architecture

Related Articles

Back to top button