Understand when to use a database, data lake, or data warehouse

Improve your enterprise data technology and strategy with 2021 transformation..

The “data” part of the terms “data lake,” “data warehouse,” and “database” is easy to understand. The data is everywhere and the bits need to be kept somewhere. However, if you need to store it in your data warehouse, Data lakeOr an old-fashioned database? It all depends on how the data is used.

It is difficult to define the name exactly. That’s because developers can colloquially throw names when they find the next best way. Save the data and answer questions about it.. All three forms share the goal of being able to squirrel a bit so that you can answer the right question quickly.

Nevertheless, these terms have evolved to take on relatively standard meanings.

What is a database?

A database has come to mean both the software that stores and manages information and the information that is stored within the database. Developers use the word database with some precision to collect data because the software needs to know that orders are held on one machine and addresses are held on another machine. Means.

Users rarely know where the values ​​are stored and may just call the entire system a database. And that’s fine — most software development is to hide that level of detail. Among databases, relational databases have become the flagship product of many enterprise computing. The traditional format simplifies the table by placing the data in the columns and rows that make up the table and splitting the data into the required number of tables and subtables. A good relational database adds indexes to speed up table lookups.They can be hired SQL Use advanced planning to simplify repeating elements and produce concise reports as soon as possible.

Recently, non-relational type databases have been attracting attention. These so-called NoSQL databases do not store data in relational tables. Often selected when developers need the flexibility to add new fields or elements to some entries and not others.

However, there are use cases where the database is not sufficient.

What is a data warehouse?

The· Data warehouse Is a collection of databases, but some use an unstructured format for raw log files. The idea of ​​a data warehouse has evolved as a result of companies establishing long-term storage of information that accumulates daily and meeting the need to report and analyze that data.

Building a data warehouse requires more than just choosing a database and table structure, as it requires the creation of retention policies. Data warehouses often include advanced analytics to generate statistics to investigate changes over time. Data warehouse It is often tightly integrated with dashboards that quickly display data changes and graphic routines that generate infographics.

In general, the term data warehouse has come to refer to a relatively sophisticated integrated system that often imposes some order on information before it is stored.

What is a data lake?

Data lakes take a different approach to building long-term storage from a data warehouse. In modern data processing, data lakes store more raw data for future modeling and analysis, but data warehouses typically apply relational schemas to information before it is stored. Data lakes do not deserve the additional processing required, so you may not even be able to use the database to store information. The data is saved in a flat file or log.

Lakes are a better choice for storing large numbers of records in case someone wants to access some or more of them in the future. Regulatory compliance is a common use case.

Some use both metaphors for the same system. The incoming raw data is stored in the data lake and Analysis and aggregation, Information often finds a home in a data warehouse.

What are some examples?

Databases, warehouses and lakes come in many forms. This is because companies have different needs to keep a record of the past. The choices companies make to keep these records affect architecture and structure. Here are some fictitious examples:

  • Drop shipping company. They sell gadgets online and outsource fulfillment to others. They use a basic database to track orders and often destroy records shortly after the order is delivered. I feel that historical data is not needed because their products change frequently.
  • Clinic. The medical industry has elaborate regulations to protect patient privacy. They can use special services to keep patient records and provide long-term searches for queries that may come years later. This service acts like a lake because doctors and patients are not involved in studies that may involve comparing and contrasting treatment results. This service can only save and retrieve, not analyze.
  • Manufacturing company. The company has a dominant position in a stable industry and needs to make wise decisions about long-term trends in sales and prices. They need to compare regional sales over time to promise the opening and refurbishment of factories and physical warehouses. Managing this supply chain is much easier with an advanced data warehouse that can execute complex queries.
  • Network security group. Routers and switches collect large amounts of raw data about packets moving over the network in case someone wants to analyze the anomaly. These raw values ​​are stored in the big data lake for weeks until they are no longer needed. If no unusual events occur, the data is discarded without being analyzed.
  • Pharmaceutical company. The company collects raw data on clinical trials and compiles reports aggregated for regulation. The company probably wants to retain data indefinitely to support future researchers and answer questions from regulators. Use a data lake to collect initial raw information and a warehouse to store aggregated reports.

What legacy companies are doing in this area

There are two main themes. Some companies that create traditional databases have added capabilities to support analytics and turned the finished product into a data warehouse. At the same time, we are building large-scale cloud storage with similar capabilities to support companies that want to outsource long-term storage to the cloud.

Microsoft’s Azure is migrating data warehouse work, “Synapse analysis”. It integrates Microsoft cloud storage with various routines that can contain some of the artificial intelligence. The tool is designed to scale to handle petabytes of data using technologies such as Apache Spark, which were developed to transform, analyze, and query big datasets. Microsoft also emphasizes the fact that billing is separate for storage and computing, so users can save money when they can turn off instances dedicated to analytics.

Microsoft also has some of the same storage and analytics options, Data lake.. It includes both SQL-based options and more general object storage, and its marketing materials target “data of all sizes, shapes, and speeds.”

Oracle also Autonomous data warehouse For cloud and on-premises, which integrates an autonomous database with a number of tools with enhanced analytics routines. This service hides all work for patching, scaling, and protecting your data.It also includes some of the features of the data lake, including traditional big data tools such as Apache Spark. “big data” Product name.

IBM Db2 users can also choose IBM’s cloud services to build their data warehouse. Also available as a Docker container for on-premises hosting, this tool bundles machine learning, statistics, and parallel processing analysis routines with several migration tools for integrating data sources.

What start-ups are doing in this area

Many data warehouses and data lakes are built on-premises by our in-house development team. The in-house development team uses the company’s existing database to create a custom infrastructure for responding to larger and more complex queries. They stitch together data sources and add applications that answer the most important questions. In general, warehouses or lakes are designed to build strong historical records for long-term analysis.

Cloud companies offer two different solutions. First, they want to help save the data. For example, Amazon offers a variety of storage solutions at different prices in exchange for speed savings. Some tiers cost less than $ 1 per terabyte per terabyte of storage alone, but acquisition may incur additional charges.Some of the slower layers, called glaciers, can also use the basic ones SQL subset A useful feature that turns long-term storage into some kind of database to find specific data elements. Amazon also offers a wide range of analytics tools, including the RedShift cloud data warehouse, which works with all storage options.

Second, cloud companies are integrating analytics tools with storage to turn racks into data warehouses or data lakes.Google BigQuery For example, the database is also integrated with some of Google’s machine learning tools, allowing you to consider using AI with data already stored on disk.

Some start-ups offer some services, not others. BackblazeFor example, store your data at the lowest prices that can be 60%, 70%, or even lower than the major clouds. Its API is designed to work like Amazon’s S3, making it easy to switch.

Others are designed to work with any data source. Teradata And SnowflakeFor example, two companies that offer advanced tools for adding analytics to your data. It emphasizes a multi-cloud strategy so that users can build warehouses from many storage options.

What the database cannot do

Is it possible to run a database in a data warehouse or data lake?

The terminology is clear and inconsistent, but databases are generally more limited in size. Data warehouses and data lakes refer to a collection of databases that may be contained in one integrated product, but often they can be collections built from different merchants. The metaphor is flexible enough to support many different approaches.

This article Part of the series About Enterprise Database Technology trend..

Venture Beat

VentureBeat’s mission is to become a digital town square for technical decision makers to acquire knowledge about innovative technology and trading. Our site provides important information about data technologies and strategies to guide you when you lead your organization. We encourage you to become a member of the community and access:

  • The latest information on the subject you are interested in
  • Newsletter
  • Gated sort reader content and discounted access to valuable events such as: 2021 transformation: learn more
  • Network function etc.

Become a member

Understand when to use a database, data lake, or data warehouse

Source link Understand when to use a database, data lake, or data warehouse

Related Articles

Back to top button