What it is, why you need it, and best practices. This guide provides definitions and practical advice to help you understand and establish modern data lineage.
Data lineage refers to the process of understanding and visualizing data flows from source to current location and tracking any alterations made to the data on its journey. This lets you know where any specific piece of data comes from, when and where it separated and merged with other data, and what transformations that have been applied to the field, from initial input to final application.
A modern data lineage tool gives you instant visibility into the source and journey of your data. You can see how the data lineage example below would give you confidence in your data and help you trace any errors back to the root cause.
Data lineage creates a data mapping framework by collecting and managing metadata from each step, and storing it in a metadata repository that can be used for lineage analysis. (Metadata is defined as “data describing other sets of data”.) For each process applied to data in its journey, the metadata is updated as shown in the simplified data lineage diagram below.
Your organization is likely flooded by large and complex datasets from many sources—financial systems, web analytics, ad platforms, CRM systems, marketing automation, partner data, and maybe even real time sources and IoT. So, knowing where your data is coming from and knowing you can trust it can be a major challenge.
The primary benefits of a robust data lineage process are that it allows you to do the following:
Data lineage also provides “explainable BI” which is one of the top 10 BI and data trends this year.
Modern data lineage tools should make all your data transparent, trustworthy, and ready for analysis. Below are the key capabilities of the best data lineage tools.
Visualization. You should be able to easily visualize how the data travels throughout its full journey, from the data source to the end-user application.
Data Catalog. The best tools allow you to search and explore all your data with the help of an integrated data catalog.
Reports. Make sure your data is structured according to your guidelines with formal reports.
Automated documentation. Your tool should generate system documentation automatically, collecting all the node comments, metadata, tables, fields, related files and database statements for the chosen application into one single document.
Simple install. Modern data lineage tools are read-only and don't interfere with any of your company data upon installation.
Customize and connect. You should be able to easily customize your environment to suit your business and its data. Plus, easily connect your lineage tool with your visualization tools, data warehouses and cloud services.
Here are the main techniques used to perform data lineage:
Modern data integration delivers real-time, analytics-ready and actionable data to any analytics environment, from Qlik to Tableau, Power BI and beyond.