Data Craze Weekly #9

This message was sent first to subscribers of Data Craze Weekly newsletter.

Table of Contents

  1. Week In Data
    1. 5 Emerging Data Job Trends in 2024
    2. 30+ Data Engineering Projects for Beginners in 2025
    3. What should Data Engineers focus on? — Fundamentals Vs Tools!
    4. 3CX warns customers to disable SQL database integrations
    5. Amazon SageMaker Lakehouse and Amazon Redshift supports zero-ETL integrations from applications
    6. Introducing queryable object metadata for Amazon S3 buckets
  2. Tools
  3. Check Your Skills
  4. Data Jobs

Week in Data

The article “5 Emerging Data Job Trends in 2024” outlines key shifts in the data job market, highlighting the following insights:

  • Stable Job Market: Despite a 15% decrease in data job postings since 2022, the market has stabilized, indicating resilience amid tech layoffs.
  • Increased Competition: The job market for data professionals has become more competitive, with companies prioritizing efficiency and cost reduction in their hiring processes.
  • Dominance of Technical Skills: Python emerges as the primary programming language for data scientists, with SQL also maintaining a strong presence in job postings.
  • Rise of AI Engineering: New roles for AI engineers are developing rapidly, focusing on the implementation of AI applications rather than traditional data science research.
  • Emerging Roles: New positions such as quality assurance business analysts are emerging, focusing on evaluating AI outputs and ensuring product effectiveness.
  • Freelancing Growth: There is an increasing trend towards freelancing within the data sector, providing opportunities for flexible work and skill development.
  • Networking for Freelancers: Building a network and leveraging personal connections is crucial for freelancers to secure their first clients and establish a portfolio.
  • Low-Code and No-Code Tools: The rise of low-code and no-code platforms is democratizing data analytics, allowing individuals without coding expertise to engage in data tasks.
  • Specialization in Data Jobs: As automation tools take over entry-level tasks, data roles may become more specialized, requiring professionals to adapt and focus on niche areas.
  • Continuous Learning and Adaptability: Staying relevant in the evolving job market necessitates ongoing education and adaptability to new tools and technologies.

LINK

30+ Data Engineering Projects for Beginners in 2025

  • Curated list of 30 data engineering projects designed for beginners, aimed at enhancing practical skills and real-world experience in data engineering.
  • Key skills sought after in data engineering include cloud computing, proficiency in SQL, Python, and familiarity with big data technologies such as Apache Spark and Hadoop.
  • The projects cover a wide range of applications, from building analytics dashboards for Uber and Amazon to analyzing music trends on Spotify and sports performance in cricket and football.
  • Each project includes a detailed description of the tech stack, skills developed, and methodologies, making it easier for beginners to understand the scope and requirements.
  • A variety of tools and platforms are utilized in the projects, including AWS, Google Cloud Platform, and various data visualization tools like Tableau and Power BI.
  • The document encourages beginners to explore intermediate-level projects to further enhance their portfolios and showcase their skills in data engineering.
  • It provides links to source code and additional resources, making it accessible for learners to implement the projects and gain practical experience.

  • LINK

What should Data Engineers focus on? — Fundamentals Vs Tools!

Author: Shashwath Shenoy

Shashwath in his article emphasizes the critical focus areas for Data Engineers, weighing the importance of foundational knowledge against the mastery of specific tools.

  • Fundamentals vs. Tools: Data Engineers must choose between deepening their understanding of core principles (like DSA, SQL, and OOP) and learning the latest tools (such as dbt and Apache Airflow).
  • Adaptability: The field of Data Engineering is rapidly changing, making adaptability essential for career growth and relevance.
  • Long-Term Investment: Mastering foundational skills is presented as a long-term investment that provides stability and resilience in one’s career.
  • Building a Strong Foundation: Just as a skyscraper needs a solid base, a successful career in Data Engineering relies on strong fundamental knowledge.
  • Career Impact: The choice between focusing on fundamentals or tools directly influences job prospects and overall value in the industry.
  • Industry Evolution: The author shares personal experiences from over 14 years in the field, highlighting significant shifts in technology and methodologies.
  • Skills Development: Emphasizing core skills can lead to better problem-solving and critical thinking abilities, which are crucial in complex data environments.
  • Tool Proficiency: While tools are important, they should complement, not replace, a strong understanding of fundamental concepts.
  • Future-Proofing: A solid grasp of the fundamentals equips Data Engineers to adapt to new tools and technologies as they emerge.
  • Career Strategy: Engineers should assess their career goals and industry demands to determine the right balance between foundational knowledge and tool expertise.

LINK

3CX warns customers to disable SQL database integrations

Another lovely SQL Injection example this time from 3CX.

Company has issued a warning for customers to disable SQL database integrations due to a potential SQL injection vulnerability discovered in the 3CX CRM Integration, tracked as CVE-2023-49954. The flaw affects MsSQL, MySQL, and PostgreSQL integrations, particularly if the 3CX server is internet-accessible without a web application firewall. The vulnerability was reported by security researcher Theo Stein, and there was a delay in communication with 3CX regarding the issue. The warning comes after a previous supply chain attack in March 2023 that compromised the 3CXDesktopApp, attributed to the North Korean hacking group UNC4736.

LINK

Amazon SageMaker Lakehouse and Amazon Redshift supports zero-ETL integrations from applications

Love the term zero-ETL integrations that in practice leads to at least a Team of engineers supporting this zero-ETL.

Take this phrase with a pinch of salt.

  • Zero-ETL Integration: Amazon SageMaker Lakehouse and Amazon Redshift now support zero-ETL integrations, allowing seamless data access without the need for traditional ETL pipelines.
  • Unified Data Access: The integration unifies data across Amazon S3 data lakes and Amazon Redshift data warehouses, enabling powerful analytics and AI/ML applications from a single data copy.
  • Time Efficiency: Zero-ETL integrations from applications like Salesforce and SAP significantly reduce the time and engineering effort required to build data pipelines.
  • Data Fragmentation Challenge: Businesses face challenges with data fragmentation across various digital systems and repositories, necessitating efficient data access and consolidation.
  • Prerequisites for Setup: Users need to configure several components, including AWS Glue Data Catalog, AWS Lake Formation, and IAM roles, to successfully set up zero-ETL integrations.
  • Connection Process: Creating a connection to data sources, such as Salesforce, involves providing necessary credentials and selecting authentication methods (JWT or OAuth).
  • Integration Setup: Users can create zero-ETL integrations by selecting source types, replicating desired objects, and customizing synchronization intervals to manage costs.
  • Phased Data Loading: The integration process involves an initial load of data followed by incremental updates, ensuring that changes in the source data are reflected in the target database.
  • Automatic Synchronization: As new records or changes occur in the source application (like Salesforce), the data automatically synchronizes to the AWS Glue target database.
  • Global Availability: The zero-ETL integration feature is available in multiple AWS regions, enhancing accessibility for users worldwide.

LINK

Introducing queryable object metadata for Amazon S3 buckets

  • Introduction of Metadata: Amazon S3 now offers a preview feature for automatic generation of metadata when objects are added or modified, enhancing the ability to manage large datasets.
  • Rich Metadata Features: The new metadata includes over 20 elements such as bucket name, object key, creation/modification time, storage class, and encryption status, making it easier for users to track and analyze their data.
  • Integration with Apache Iceberg: The metadata is stored in fully managed Apache Iceberg tables, enabling users to query the metadata using various compatible tools like Amazon Athena, Amazon Redshift, and Apache Spark.
  • Efficient Querying: Users can quickly locate objects based on specific criteria (e.g., size, tags) without needing to build complex systems, improving scalability and synchronization with actual object states.
  • Metadata Capture Process: Users can enable metadata capture by specifying a table bucket and table name, with updates recorded in real-time, allowing for efficient historical data retrieval.
  • Practical Implementation: The document provides command-line examples for creating metadata tables and querying them, demonstrating how users can implement the feature in real-world applications.
  • Console Management: Users can also manage metadata configurations through the Amazon S3 Console, simplifying the setup process for those who prefer a graphical interface.
  • Availability: The Amazon S3 Metadata feature is currently available in preview in specific AWS regions, allowing users to start utilizing it immediately.
  • Integration with AWS Glue: The metadata feature integrates with AWS Glue Data Catalog, enabling users to visualize and query S3 Metadata tables alongside other data sources.
  • Pricing Structure: Pricing for the new metadata feature is based on the number of updates and storage of the metadata table, encouraging users to consider their usage patterns when adopting the service.

LINK

New Amazon S3 Tables: Storage optimized for analytics workloads

Another item from AWS re:Invent

  • Introduction of S3 Tables: Amazon S3 Tables are designed specifically for storing tabular data, optimizing analytics workloads for improved performance.
  • Performance Benefits: Using S3 Tables can lead to up to 3x faster query performance and 10x more transactions per second compared to self-managed table storage.
  • Integration with Query Engines: The tables support queries through popular engines like Amazon Athena, Amazon EMR, and Apache Spark, making data analysis more accessible.
  • Table Buckets Concept: S3 Tables introduce table buckets, a new type of S3 bucket that functions as an analytics warehouse for Iceberg tables, providing durability and scalability.
  • Maintenance Automation: The service automates critical maintenance tasks such as compaction, snapshot management, and removal of unreferenced files, reducing the operational burden on users.
  • Command Line and Console Accessibility: Users can create and manage table buckets and tables through both the AWS Command Line Interface (CLI) and the AWS Management Console, offering flexibility in usage.
  • Namespace Functionality: Each table bucket can utilize namespaces to logically group tables, simplifying access management and organization.
  • Automatic Encryption and Security: All objects in table buckets are encrypted by default, and the service enforces block public access to ensure data security.
  • Regional Availability: The new feature is currently available in specific AWS regions, including US East (Ohio, N. Virginia) and US West (Oregon).
  • Pricing Structure: Users are charged based on storage, request fees, and additional costs for operations like compaction, ensuring clarity in cost management for S3 Tables.

LINK

Tools

No tools this week, haven’t found anything interesting to share.

Check Your Skills

Problem to solve

How to safely cast value to data type without raising error?

Example

    SELECT CAST('7492bd12-1fff-4d02-9355-da5678d2da' AS UUID) as id
         , 'test_ABC' as col_txt
         , 100 as col_int
      FROM public.casting_tst;
    
    -- [22P02] ERROR: invalid input syntax for type uuid: "7492bd12-1fff-4d02-9355-da5678d2da"

Try to use only plain SQL - SQL Fiddle Playground

Solution

Check solution with details in my post - SQL Safe CAST

Data Jobs

Skills sought: Python, BigData, Docker, Terraform, Linux, Kubernetes, SQL

Don’t miss next week edition. Subscribe!

Data Craze Weekly

Weekly dose of curated informations from data world!
Data engineering, analytics, case studies straight to your inbox.

    No spam. Unsubscribe at any time.


    The administrator of personal data necessary in the processing process, including the data provided above, is Data Craze - Krzysztof Bury, Piaski 50 st., 30-199 Rząska, Poland, NIP: 7922121365. By subscribing to the newsletter, you consent to the processing of your personal data (name, e-mail) as part of Data Craze activities.


    This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.