Data Craze Weekly #8

This message was sent first to subscribers of Data Craze Weekly newsletter.

Week in Data

Data Lineage, what is it and is it needed?

What is Data Lineage?

In simple military terms, it is nothing more than documenting the flow of data. The column in a specific table is the place in the report where it is used.

Documenting such a flow is an interesting problem because we have a multitude of tools for data processing and visualization.

Two standards for writing data pipelines differ between teams (and even between people in the team) - e.g. SELECT abc FROM xyz vs SELECT * FROM xyz (assuming that there is only 1 column abc in the XYZ table, the result will be the same but whether the tool will know what it is *).

But don’t worry, like everything else, there will be a panacea for data lineage. In addition to paid commercial solutions, it is worth paying attention to Open Source projects.

The author mentions, among others:

Apache Atlas
Airflow Lineage Backend
Marquez
OpenLineage

Why I decided to bold OpenLineage is because other products (e.g. Marquez) send flow information based on its standard.

OpenLineage is an open source framework for sending lineage metadata between services. This is the standard that is used by Marquez and many other systems such as Apache Atlas, Amundsen and Egeria. If you read the documentation, OpenLineage seems very tightly connected to Marquez. But don’t worry, OpenLineage is supposed to be a generalized framework for anyone to use.

They know what data lineage is, it is worth asking yourself whether it is needed?

The answer is up to you, do you want to go through X files until you find where a given attribute comes from, or do you prefer to search for it and get the entire flow directly?

Link: https://medium.com/bliblidotcom-techblog/data-lineage-state-of-the-art-and-implementation-challenges-1ea8dccde9de

CTE, not always safe and effective

CTE, i.e. Commont Table Expression, or otherwise known and popular WITH.

Praised by everyone, a frequenter of salons and well-thought-out SQL queries. The creator of order, the sower of the DRY idea (Don’t repeat yourself), what could be wrong with him?

In his article, Adrian Bednarz shows, using the example of Snowflake, how CTE can negatively affect performance (a very specific case) and how to deal with it.

Surprise, surprise! This code executes in 6 seconds. One CTE and 2x performance degradation. But why? The way Snowflake optimizer tackles this query is to gather all conditions from all the subqueries, join them with ORoperator and apply to imported table and then forward the elements to intermediate CTEs. This essentially means that we need one extra scan over the data.

And if you use a PostgreSQL database, it is worth reading about the MATERIALIZE modifier using CTE HERE.

Link: https://techwithadrian.medium.com/the-hidden-risk-of-using-ctes-53b241e256b2

Library for DataQuality, from an ML (Machine Learning) perspective

Colleagues at DoorDash have created a useful library for use in Python to help us gather information about quality in datasets.

A hugely important issue in general, but especially from a machine learning perspective: garbage in garbage out.

Data quality issues come in a variety of forms, from obviously missing and extreme values to biases hiding in duplication and defaulting. Detecting these issues quickly and diagnosing likely causes will help prioritize which problems need to be solved, which can be handled through ML techniques, and ultimately lead to better performance and success in ML modeling projects.

The library allows you to easily, among others: on:

identifying missing values
identifying incorrect values
finding anomalies in data distribution (so-called outliers)
finding invalid data types

Link: https://doordash.engineering/2022/09/27/five-common-data-quality-gotchas-in-machine-learning-and-how-to-detect-them-quickly/

Modern Data Stack example from GetInData in project for Volt.io

I love articles like success/failure stories, they allow me to understand the problem someone faced and the path (decisions made) to solving it.

In this particular case, we get a way to approach the new data architecture (based on the Modern Data Platform) for Volt.io created by the team from GetInData.

System elements:

data ingestion: Airbyte (EC2 instance, external Aurora DB)
data transformation: dbt (as binary on MWAA machines)
data presentation: Looker (EC2 instance)
data orchestration: MWAA (the managed version of Apache Airflow on AWS)
data governance POC: Datahub (EC2)
data warehouse: Snowflake (private link)

In the article, apart from a lot of meat (diagrams, decisions), you will also find tidbits such as the use of sqlfluff to standardize the code.

Link: https://getindata.com/blog/how-we-built-modern-data-platform-for-fintech-scale-up/

Tools

BetterBrain – what if we simplified writing SQL to complete English sentences – SELECT * FROM products = Give me all products – good?

This is what the BetterBrain company/product wants to do. Subjectively, these are cool and not cool products.

Not cool for two reasons:

The creator’s rhetoric is just bad “Everybody hates writing SQL.” – hmm, probably not. Subjectively, it would be better to refer to repeatable, simple activities in SQL (e.g. counting unique table elements - commonly used in testing).
Similarly to ORMs built into frameworks, e.g. Django (in Python) or Spring (in Java), they allow us to bypass writing SQL. However, do they always provide good/optimal results (queries), unfortunately not.

Cool because:

It will allow people who know the business domain very well but do not know SQL (they are just learning) to enter the world of databases. An investor gaining access to data in a database may not know how and how to connect tables, but he or she sees that there are tables A and B and it is worth using them.
It will relieve you from writing boring, repetitive queries - for example, counting unique rows
It will give more work to people who work with SQL on a daily basis - someone will have to optimize and maintain these wonderful automatically created queries 😀

Unfortunately, there is no access to the tool at the moment, you can only sign up for the waiting list.

Link: https://twitter.com/abhargava20/status/1577713811150311444

Check Your Skills

#admin #sql

This time the task was to search through documentation.

Problem to solve: Check whether there are duplicate indexes in the database (data warehouse) you use. Duplicate index = more than one index (of the same type) on the same field in the same table in the same schema in the database.

To make the task easier, two links to documentation (PostgreSQL)

pg_index
IndexMaintenance

More SQL related questions you can find at SQL - Q&A

Data Jobs

Data Engineer / Data Architect, RAMP Network – Poland, Remote – PLN 10,500 – PLN 36,500 (B2B / Monthly – based on role MID / Senior / Lead)

Skills sought: GCP (BigQuery), Python, SQL, dbt, Airflow,

BI Consultant / ETL Developer, Predica – Fully Remote – 9,500 – 19,000 (B2B / Monthly)

Skills sought: SQL, ETL Tools (ex. SSIS), ADF (Azure Data Factory), Data Warehouse concepts