How do data engineering and ETL match

Away with ETL!

In this article, I'll explain how you can significantly reduce the time and money involved in ETL. I am going to introduce you to the MarkLogic approach. In fact, with MarkLogic, the goal is not really to “get rid of ETL”; it is taking a new, modern approach in which you datain any formatbe able to load, search, harmonize and enrich. We call this approach the "E-L-T approach" - extraction,loadand then transform. Our customers regularly use this MarkLogic approach because it is faster and more flexible. That means: It only takes weeks or months to go live instead of years.

But I'm anticipating. Let us first consider how the integration of data has happened so far: extracting, transforming, loading.

  • Extract: the process of fetching data from the source databases. This step also typically verifies that the extraction performed as expected.
  • Transform: the process of changing the data schema to be consistent with other data sources and to conform to the target schema before data is loaded into the target database. This can be simple formatting such as updating a data format or selecting which columns to include. However, the process is often more complex and includes pivoting, aggregation, disaggregation, translation, encryption and so much more.
  • Load: the process of loading or including data in the final target database. Strategic decisions have to be made about the time, frequency of loading processes and whether existing data should be overwritten or supplemented by the newly loaded data.

The ETL process may seem like a simple 3-step sequence at first. In reality, however, it is very expensive and time consuming.

ETL software is expensive

A variety of tools are offered to help manage the process. Examples of providers of ETL software are: Informatica (PowerCenter, PowerExchange, Data Services and 15 other products), IBM (Infosphere product suite), SAP (Data Services, Process Orchestration, Hana Cloud Integration, etc.), Oracle (Data Integrator, GoldenGate, Data Service Integrator, etc.), SAS (Data Management, Federation Server, SAS / ACCESS, Data Loader for Hadoop, Event Stream Processing) and the open source provider Talend (Open Studio, Data Fabric, etc.). Of course, there are many vendors who would like to cut a piece of the ETL pie, and I'm always surprised at how many ETL tools companies can afford.

In total, companies spend $ 3.5 billion each year on ETL software for data integration (Gartner. Forecast: Global Markets for Enterprise Software), 2013-2020, Update 1Q16. March 17, 2016). All of these tools are used to prepare data for loading into an operational database or data warehouse. However, the ETL work takes up a disproportionate share of the project costs and budget. In fact, 60 to 80 percent of the total cost of a data warehouse project can be attributed to ETL software and processes, “TDWI, Evaluating ETL and Data Integration Platforms”.

ETL takes a lot of time and effort

The procurement of the tools is already expensive - but that's not all. Setting up, using and managing the ETL tools and database connections requires a lot of work. After setting up the tool, the data sources must be compiled and then the data models developed. In addition, the changes that occur in the course of the process must be managed. The numbers show that large data integration projects at Fortune 500 companies take two to five years. This is understood from the start of the project to the productive start of the application! The total cost of ETL and data integration projects adds up to approximately $ 600 billion per year (MuleSoft estimates). That estimate seems very high to me, but I think we all agree that ETL is a really expensive endeavor.


Reduce the cost and time required for ETL

MarkLogic takes a different approach to data integration. But how does this approach differ?

FirstMarkLogic can manage multiple schemas in the same database, meaning that you have datain any formatcan load without having to define and agree a single “over-scheme” in advance. So you can get started faster. It also eliminates a lot of tedious rework that relational databases typically do when a change is required or a new data source is added.

Secondlyenables MarkLogic to harmonize and enrich the data within MarkLogic. So after loading, you can do all of the transformation work (the “T” in ETL) right in MarkLogic - just as reliably as with a transactional database. And you don't need a separate tool for this.

Would you like to know how much time and money this will save?

A typical MarkLogic data integration project is four times faster than using traditional methods of data integration with a relational database and ETL tools. Customers using MarkLogic have shown the evidence with many large projects in a variety of industries. And it's not just about speed - better data quality is also achieved. We'll go into that in more detail later. Let us first consider data integration using the traditional approach.


Challenges in relational databases and ETL processes

When integrating data from data silos, most companies work with multiple relational databases. You are trying to efficiently transfer the data from these existing databases to another database.

The most common reasons for integrating data are:

  • Introduction of a new application that should also access existing data
  • Development of a uniform 360-degree view of data in an analytical data warehouse
  • Development of a new system for master data management (MDM)
  • Migration of data between operational systems for internal or cross-company purposes
  • Support of governance, audits or monitoring tasks
  • Mergers and acquisitions
  • permanent storage of data that should be kept

The process of integrating data is slightly different in each scenario. This depends on the scope of the project, the type of data sources and the tools used. The general process always requires extensive ETL processes to convert heterogeneous data into homogeneous data for a single schema. This is independent of the use case or pattern. When you do this process it usually looks something like this:

1. Put all the schemes together

First, you need to know what data you have and how it is organized. To do this, you have to bring together all the schemas that describe how your data sets are stored in a “superset”. This information is usually not contained in the database itself. Schemas are usually represented in the form of Entity Relationship Diagrams (ERD), which can be printed out and posted or stored somewhere in a folder. In my consulting work for other companies, I have found that this folder can usually be found with the database administrator who has been with the company the longest.

2. Analyze schemes

When you have all the schemes in hand, it is time to take a look and find out what they mean, how they work, and what they have in common. You may need to dig deeper into your data (and in some cases your code) to fully understand it. For example, if you have a “customer” entity, you may need to look at different rows in different tables to find out what “customer” means in that database.

3. Determine what can be left out

If you try to incorporate all of your schemas AND keep all of your data, you will never get done. So you decide which data is not that important and only take over the data that is really needed for the project you are working on. Let's say you are facing a data integration project with 30 schemas. This is where you will definitely be wondering if you can leave out some low priority data so that you don't have to do the modeling 30 times. Perhaps the transaction status from the first model is not strictly necessary for the application you are building with the combined data. Maybe you could do without the delivery address and could simplify the model a lot by just taking one phone number per customer ... you know what I mean.

4. Develop “over-scheme”

Now you have to develop a scheme that “rules everything” - that is, a scheme in which there is a place for all the data that you want to take over from the various source systems. In addition to planning the new schema, you also need to decide how to handle the various errors (for example, invalid data in an input field), how to keep the metadata about the source of each data element, and what transformations have been made to bring it into the new one Scheme. And if you want to capture changes to your data and schemas, you need to think about versioning too.

5. Write ETL jobs and load data

Let's say you made it this far. You can now turn to moving the data. To do this, you must first ensure that all required transformations are carried out. You can then begin bulk loading the transformed data into your new database.

6. If there are changes, start over

If a new data source is added or another change is made during the process that affects the schema, you'll have to start over from step 1 - unless you're lucky and the new data source fits into the existing over-schema. But often a new many-to-many relationship or another concept of the data is introduced in the new data source, which requires a change, which in turn affects the existing applications. You too are probably familiar with cases in which a person suddenly has many addresses when you are expecting only one, or in which the VARCHARs in a table become a problem.


MarkLogic's Approach to "ETL"

With MarkLogic, data integration is much easier and faster:

1. Data in any formatload

You first load data sourcesin any formatin MarkLogic and use the built-in MarkLogic search function (the universal index “Ask Anything”) from day 1 to get to the bottom of the data. You don't need to create a superordinate schema. You can load further data at any time, even if these are subject to a different scheme. MarkLogic can easily work with multiple schemas in the same database, and all data is instantly searchable.

2. Harmonize data

This roughly corresponds to the “T” in ETL, but differs from the traditional ETL process in two important ways. First, you don't need to leave out any data. Second, you don't need to transform ALL of the data. If, for example, “post code” is used in one scheme and “post code” in another, you only need to transform the affected elements in the data in order to be able to search for “post code” in a uniform manner. In doing so, you carry out the harmonization in such a way that the original data under "Postcode" is not lost. We'll go into more detail about how this is handled in MarkLogic later.

3. Enrich data

In addition to harmonization, you can also enrich the data. You can give them additional context so that they can better understand what they mean in the context of your business. You expand the raw data from your silos with new or expanded data to enable new applications, functions or analyzes. Suppose you have a record with an address. You can then enrich this data with GPS coordinates, for example - this is a schema change that can easily be made with the document model from MarkLogic.

4. Load more data

Since data integration with MarkLogic is iterative and not in the form of a “big bang”, you can always load new data and adapt your data model as required.

One benefit for customers is that they do not need any ETL tools in this data integration process. You can do any work you need from within MarkLogic. This simplifies the process and saves customers a lot of money.

In addition, everything goes much faster - about 4 times faster. A head-to-head comparison between MarkLogic and the traditional relational database + ETL approach at a number of Fortune 500 companies showed that large MarkLogic projects were typically up and running in 6 months instead of 2 years. The diagram below shows where the most time is typically saved:

* Based on a comparison of the actual data at a Fortune 500 company and the average project completion times for other customers from different industries.

In addition to faster project processing, a higher quality result is also achieved. Instead of creating another data silo, you can now manage your data centrally and in a schema-agnostic manner, relate it to one another and store it in a reliable and transaction-oriented way. Unlike other NoSQL databases, MarkLogic can execute cross-document transactions that are 100% ACID-compliant.

You can use MarkLogic for both operations and analysis without having to deal with endless ETL cycles. Your data can keep growing and you don't need to spend time or energy moving it from silo to silo. The transfer of data between silos does not add value and should therefore be reduced to a minimum.


Closer examination

Let's take a closer look at each step in the process to see how MarkLogic achieves these results when it comes to integrating data from silos.

Step 1: loading datano matter what format

MarkLogic is schema-agnostic and uses a flexible data model that stores data in the form of JSON and XML documents. (It is also possible to save as an RDF triplet, but we will concentrate on documents first.) Documents are flexible, hierarchical data structures and thus differ from the strict row and column format. MarkLogic can manage documents of various formats. That said, MarkLogic can work with multiple schemas in the same database, and you can easily edit your datain any formatload. That in turn means:

  • You don't need to create an “over-schema” that requires you to transform all of the data before you can load it into the database.
  • You don't have to do too much at once and can load some of your data immediately and others only later.
  • You don't need an ETL tool to prepare and transform the data to be loaded.

Customers typically use a tool called the MarkLogic Content Pump (MLCP) to load large amounts of data. MLCP is an efficient and powerful tool that, if required, transforms the data as soon as it is recorded. But you also have the option of transforming and harmonizing the data in MarkLogic later. (There is a short on-demand tutorial video on this from MarkLogic University.) If you want to work without the MLCP, you can also load the data using other methods, such as the REST API, Java Client API, the Node .js client API or native Java / .NET connectors.

Once the data has been loaded, you can search it right away as it is instantly indexed with MarkLogic's universal “Ask Anything” index. This works in a similar way to a Google search in the database - you search for a keyword and see which documents contain it.

The integrated search function is extremely important for data integration. Consider the example above again: Suppose you have loaded documents about customer records that contain information about postal codes. In some records these are listed under "ZIP code", in others under "ZIP code". You can use MarkLogic to search for “94111” and find all records with the corresponding zip code.In a relational data integration project, you would need to harmonize all of the records before loading. The only other option would be to omit the zip code and not keep it. You would spend a lot of time harmonizing all of the attributes from all of the entities in all of the source schemas, regardless of whether they are needed for the queries. The only other option would be to leave them behind.

If you are up for the data loading phasein any format If you are interested, read this blog article: “What LoadAs Is Really Means ”(What loading data in any format really means).

Step 2: Harmonize the data

When you have loaded the data and checked the result, the question arises: Which part of your data do you really need to harmonize in order to enable standardized queries? You should include your idiomatic (i.e., native, raw) data - even the curious ones - and get clear, canonical representations as output.

The zip code is probably an important element to harmonize as you will need it for your application.

Once you have decided which data you want to harmonize, you can use the envelope principle to build a growing canonical model (i.e. a simple pattern for managing agreed data definitions). The envelope principle is a best practice method for data integration in MarkLogic. An envelope is created that only contains the canonical elements that are to be queried uniformly in all data sources. So if you think you need the zip code, you can put this item in the envelope. If you then want to add “City”, “State”, “Favorite soccer team” or another element later, that's no problem.

Here is an example of the envelope principle, where the original source data in each JSON document has been added to a new trunk with the property "source":

In this example, it might be important to harmonize by zip code because you are developing an application that shows how many customers you have in each region and what they have bought. In addition, it should be possible to drill down to the product categories and types and to the product description texts. Your regions are defined by zip code, so you must consistently query for "Zip".

In fact, it's pretty easy to add an envelope in MarkLogic. The following example is intended to illustrate this. It's an update query that you would run to add the envelope as a JSON property:

declareUpdate (); // get all documents from the “transform” collection var docs = fn.collection (“transform”); for (var doc of docs) {var transformed = {}; // add a property as an empty object to add further data the original document transformed.envelope = {}; // save the original document in a new property called “source” transformed.source = doc; xdmp.nodeReplace (doc, transformed); }

The envelope pattern gives you the flexibility to build your canonical data model from the bottom up. Your model can meet the combined needs of all applications and services that use this data. The individual applications only need to understand the parts of the model they are using. And because every data set contains a consistent canonical format, the data from MarkLogic can be used by everyone, even without understanding the source schemas from the upstream systems.

Step 3: enrich the data

What if you not only want to transform and harmonize your data, but also supplement and enhance it? Then you can enrich the data. This optional step is extremely useful as it ensures that important information is linked directly to the data it relates to. In other databases these are often missing or difficult to find.

In the example below, metadata is added to the document via metadata, which contains information about the source of the data, the date it was recorded and where it came from.

{

“Metadata”: {“Source”: “Finance”, “Date”: “2016-04-17”, “Lineage”: “v01 transform”},

"canonical": {"Zip": [94111]}, "source": {"ID": 1001, "Fname": "Paul", "Lname": "Jackson", "Phone": "415-555- 1212 | 415-555-1234 "," SSN ":" 123-45-6789 "," Addr ":" 123 Avenue Road "," City ":" San Francisco "," State ":" CA "," Zip ": 94111}}

You may also want to add important information like geographic coordinates. This type of schema changes are quite difficult in relational databases, but are extremely easy to do with the document model:

{"metadata": {"Source": "Finance", "Date": "2016-04-17", "Lineage": "v01 transform"

“Location”: “37.52 -122.25”,}

, "canonical": {"Zip": [94111]}, "source": {"ID": 1001, "Fname": "Paul", "Lname": "Jackson", "Phone": "415-555 -1212 | 415-555-1234 "," SSN ":" 123-45-6789 "," Addr ":" 123 Avenue Road "," City ":" San Francisco "," State ":" CA "," Zip ": 94111}}
Another optional step: adding semantics

Earlier I described how you can take advantage of the MarkLogic document model. But that's not all. In addition to the simple transformation already described, there are many more options for adding semantic triples with the multi-model functions of MarkLogic. For example, you can add triple metadata to create relationships between documents that make it easier to work with joins, and make it easier to build queries across the multitude of relationships in your data. An example: Customer 1001 has placed order 8001 and order 8001 contains product 7001. Using triples, the database can deduce from this that customer 1001 has ordered product 7001. These are three simple facts that describe relevant relationships and for which modeling as a triplet is ideal. And because data and queries can be compounded in MarkLogic, you can include documents and triples in the same query. This is a unique feature not found in other databases.

In a relational database, the relationships are inherently weak. However, the semantics gives you a meaningful diagram that you can take advantage of. Much more could be said about adding semantics for data integration. However, I do not want to go into more detail here and dedicate another article to this topic.

If you'd like to learn more about harmonizing and using semantics, check out our new on-demand training course, Progressive Transformation Using Semantics, offered by MarkLogic University.

Step 4: loading more data

Instead of defining your schema in advance and loading all data according to this fixed schema, MarkLogic allows you to always load new data sources - regardless of whether they match the canonical schema you are already using.

So you can start loading data for each new data sourcein any formatContinue. Then think about which elements of the new data you want to put in your envelope. If the source schema does not contain anything that can be assigned to your canonical schema (e.g. no zip code), then nothing goes into the envelope. If the data has something new to include in the canonical model, just add it. You don't need to recreate the entire database because a change to the canonical model resulting from the incorporation of a new schema will not affect your existing data or the applications or services that use your data.


The result with MarkLogic

One of the great things about MarkLogic is that it can accelerate the data integration process up to four times. The fact that a project can be completed in 6 months instead of 2 years is an important factor. MarkLogic makes it easier to get data out of silos, and its transformation can be done right within MarkLogic. This allows you to develop your operational data hub with a flexible process that is better and faster than the process with relational databases and conventional ETL processes. Let's summarize why the "ELT" approach with MarkLogic - loading the datain any format and subsequent harmonization - is better and easier than the conventional ETL method:

  • Load source data -in any format
  • Preserve lineage and other metadata
  • Harmonize only what is necessary
  • Update the model later without a new acquisition process
  • New schemes do not affect existing data / applications

Another important benefit of the MarkLogic approach is that it also provides better data governance. This is important because, in most cases, not just one application is crucial for integrating data. You might start with a specific use case. Ultimately, however, the cycle of creating new silos should be put to an end. Your data should be able to be centralized, standardized and structured and organized in such a way that it can do justice to many different use cases.

To avoid creating new silos over and over again, you need to think about the data governance aspects of your new data hub. These include security, data protection, lifecycle, retention, compliance and other guidelines and rules that are independent of the respective applications.

What role do descent and origin play in data governance? With MarkLogic, you know where each attribute came from, when it was extracted, what transformations it underwent, and what it looked like before. Nothing is lost in the ETL process. MarkLogic's schema-agnostic concept puts your source data, the canonical data, and all metadata in the same data set, and they can all be queried together. With this approach, you spend less time moving data around the company and debugging transformation code, so you have more time developing applications and getting value from your data. There are similar advantages for security, confidentiality and compliance when all data is available and can be tracked in a unified, centralized hub.


the next steps

There is much to learn about this topic, and this article should provide detailed examples of why MarkLogic is so well suited for data integration. If you want to dig deeper into this topic, I recommend the following links: