Data Craze Weekly #5

This message was sent first to subscribers of Data Craze Weekly newsletter.

Data Craze Weekly

Weekly dose of curated informations from data world!
Data engineering, analytics, case studies straight to your inbox.

    No spam. Unsubscribe at any time.


    The administrator of personal data necessary in the processing process, including the data provided above, is Data Craze - Krzysztof Bury, Piaski 50 st., 30-199 Rząska, Poland, NIP: 7922121365. By subscribing to the newsletter, you consent to the processing of your personal data (name, e-mail) as part of Data Craze activities.


    This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

    Week in Data

    PostgreSQL optimization

    The slides in the link were created in 2017 and I came across them by accident last week.

    They are so solid (and still valid) that I couldn’t help but share them with you.

    The author focuses on showing how to optimize SQL queries, which may seem quite common and trivial, but unfortunately cost us (at least in PostgreSQL) a lot of computing power and time.

    Not using PostgreSQL? It doesn’t matter, check the queries and see if they also apply to your database engine.

    Slides from 38, titled DISTINCT, crushed me in my seat.

    And another light quote from the author:


    – Efficient execution of some popular queries requires the implementation of the alternative procedural algorithm
    – Implementation of custom algorithms is usually easier when using PL/PgSQL
    – The same algorithm implemented on SQL runs faster
    Process:
    – Implement and debug algorithm on PL/PgSQL
    – Convert to SQL

    Link: https://www.slideshare.net/pgdayasia/how-to-teach-an-elephant-to-rocknroll

    What is Kafka and do you need it?

    Great article when you are faced with the choice of using Apache Kafka.

    What is this tool (technology), when is it worth using it, and when is it better to refrain from using it?

    Below is a short quote from the Conclusion section, but it is really worth reading the whole thing.

    Kafka is a highly scalable and durable message processing platform with great real-time data processing features. It will be a good fit in use cases like IoT, Click Stream Analytics, Real-Time Data Integration, Event Sourcing, Log Aggregation, etc. But it is not a solution that can be used in any data processing requirement. Kafka should not be used as an ETL tool or as a database even though its feature set may seem similar.

    Link: https://memphis.dev/blog/apache-kafka-use-cases-when-to-use-it-when-not-to/

    PDF divided into thematic parts. Each part ends with recommendations.

    It is worth reviewing, at least from the perspective of the directions in which large companies are heading.

    Two quotes below:

    Companies that are developing AI will increasingly spin up their own Ethics as a Service (EaaS) offerings within their professional service organizations. We will see a race to hire AI ethicists to become compliant with the new regulations, making AI ethicists in even greater demand than AI developers.
    — KATHY BAXTER, PRINCIPAL ARCHITECT, SALESFORCE ETHICAL AI PRACTICE

    Data quality and data-driven decision-making go hand in hand. An organization-wide commitment to data governance mitigates risk and drives future success for everyone in the business.
    —SCOTT TEAL, PRODUCT MARKETING MANAGER, SNOWFLAKE

    Link: https://www.tableau.com/sites/default/files/2022-02/Data_Trends_2022.pdf

    Tools

    TablePlus - “a native application which helps you easily edit database contents and structure in a clean, fluent manner.”

    Do you use any IDE to work with the database, e.g. DBeaver? Maybe it’s worth testing something else?

    If so, TablePlus comes to the rescue. A “nice” (a matter of taste) theoretically native (natively supporting specific databases) tool.

    In theory, you can use it for free (at least that’s what the repository says), but in practice, free use severely limits the tool: The free trial is limited to 2 opened tabs, 2 opened windows, 2 advanced filters (filters are not available on the free TablePlus Windows) at a time. We can change the limitations without any notifications in the future releases. As an alternative to e.g. Data Grip, it is worth considering.

    Link: https://github.com/TablePlus/TablePlus

    Check Your Skills

    #SQL

    No assignment today, but please open the slides from the first link.

    HERE (as a reminder)

    Take a look at example 02 “IOS for large data offsets”.

    The most common pagination on websites OFFSET + LIMIT.

    Read what consequences it may have with a large offset. Check if such situations apply to you.

    More SQL related questions you can find at SQL - Q&A

    Data Jobs

    Skills sought: SQL, Python, Spark