How To Start Career In Data World

It would seem that over a decade ago, when I started my adventure with data, it was easier. There were not that many IT positions related to data processing, mainly SQL Developers, Data Warehouse Programmers, and Business Intelligence Developers.

Time passes, technologies change, just like the job titles. Today, we often meet Data Engineers, Data Analysts and Business Intelligence specialists on the market.

How not to get lost in all this, where to start? What should we pay attention to when we want to start our adventure with data in the new decade? We will try to answer these and other questions in this article.

TL;DR

Summary in 3 sentences: Start with SQL + the basics of relational and non-relational databases. Watch how “big” players build architecture. Think about what your business really wants and try to provide it to them in a clear form using the “appropriate” (i.e. appropriate for the right task) tools.

How to start your adventure with data

Let’s start by dividing the data area into 3 layers:

  • data engineering
  • data analytics and visualization
  • data science + machine learning

Data Role Pyramid

Source

In each of them I will define: basics, materials and things worth following.

DATA ENGINEERING

Data engineering will be useful to a greater or lesser extent to anyone who wants to work with data. What is its purpose? Quite simple, it is the heart of your data system. It is with its help that you pump data from various sources, make transformations, ensure quality and ensure that everything is done on time. Without solid data engineering, building sensible analytical models, making business decisions based on visualizations, or approaching the topic of data science does not make much sense.

The terminology you will encounter in this part includes: data warehouses, big data, ETL (Extract – Transform – Load) or ELT (Extract – Load – Transform) processes, data streaming, queues, scheduling processing) etc.

Note: in job offers you will find a lot of catchy slogans (sometimes sounding like Pokemon names) related to the technologies used in this layer. Don’t let this confuse you. Technologies are important, but not the most important. Like everything in the IT world, they change, but the basics, concepts and certain good architectural practices remain fundamental for a long time.

Therefore, in this and in subsequent layers, the key elements will be the most important for me. Elements that will allow you to build a solid basis for further development.

DATA ENGINEERING – BASICS

SQL

Can you hear those grunts, whispers and signs of disbelief? “SQL is dead”, “everyone knows SQL now”, “NoSQL is the future” call specialists from various places, perhaps “but let’s not anticipate the facts”.

I have been working with data for over 10 years, in various positions in various companies - mainly large foreign corporations. I know one thing for sure - “SQL is the king of data, just like the Lion is the king of the jungle.”

The language in its basics is simpler and logical, which makes it easily accessible to various positions, not only those strictly related to IT. It has been with us almost as long as relational databases themselves - well over 3 decades - 1986.

Don’t ignore it, even if SELECTs are not your everyday routine, the number of random (or better yet, non-random) places where it will be helpful is huge. Mainly in places where there are relational data sources “underneath” (relational databases - ERP / CRM systems / Data Warehouses),

Relational Databases – main concepts + architecture of the selected database, e.g. PostgreSQL

Just like with SQL - it doesn’t scale, a relic from the 1970s. NoSQL databases, graph databases are the future, perhaps “but let’s not anticipate the facts”.

For almost half a century, IT systems have been created based on relational databases, and they continue to be developed. Is it good or bad? It’s not up to me to judge, but the fact is that you will encounter relational databases more than once on your journey.

Understanding the concept of relationships, A.C.I.D principles, architecture (what the database actually looks like), and familiarization with the database itself from at least one vendor is a reasonable minimum.

Non-relational databases (NoSQL) – main concepts + a few overviews

Not Only SQL, that’s right – not only SQL. No, a replacement for SQL. This is important, just like the word complementary, i.e. complementary / complementary.

This is how I see the role of NoSQL systems, as complementary systems to relational systems. Used where necessary, taking into account their specificity. Is it possible to “rewrite” the system from a relational database to a non-relational database? Sure, you can, but why?

When getting acquainted with the subject of NoSQL databases, pay attention to the key concepts and division of non-relational databases - key value databases, document databases, etc. Pay attention to where they are used, how “big players” use them (e.g. Uber, Spotify).

In the materials below you will find a book that smoothly introduces this topic.

Data Warehouses (Kimball) + Quality and Metadata Management

All this data has to be stored somewhere, data warehouses and the recently popular date lake are not going away any time soon and the problems they face are interesting.

In this context, it is worth taking a look at the basics of data warehouse architecture and issues from Ralph Kimball. In your books, but also on the website (https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/) you will find a lot of information and good practices in the context of modeling, management and working with data in general.

Remember that technologies are changing, but the topic of metadata, data quality, and roles in data processing teams will always be relevant.

ETL processes / “Streaming” data + events (see how others do it – Uber / Spotify / Netflix etc.)

We go down to the details, i.e. the very flavor of data processing. And I don’t have one great recipe for everything here.

The approach and methods of solving problems will vary depending on your workplace - at least that’s what I assume. From my own experience, there are always some differences 😉

However, what is definitely worth doing is monitoring how such topics are addressed by companies that base their business to a significant extent on data. Uber, Spotify, AirBnB, Netflix on the Internet you will find blogs of these (and other) companies or entries of specific employees. Often describing in detail the approach to architecture or solving specific problems.

This is usually a large dose of knowledge, but also good material for considering how a given approach works for others and whether it is needed for you?

Big Data (Lambda Architecture)

A popular topic, everyone has a big date, few people know what to do with it… You can choose a certain technology and after a month it turns out that there is something else on the market.

In this and in the previous paragraphs, I would focus on the general approach. Getting to know the problems that Big Data has to face and potential solutions. As you dig deeper, you’ll realize that the problems may be logically simple, but when it comes to implementation, it’s not all that trivial. Specific solutions/problems may become your specialization, but “let’s not get ahead of ourselves.”

DATA ENGINEERING – Materials

  • My Newsletter: datacraze.io
  • Book: Bill Karwin – SQL Antipatterns: Avoiding the Pitfalls of Database Programming
  • Book: Markus Winand – SQL Performance Explained
  • Book: Alex Petrov – Database Internals: A Deep Dive into How Distributed Data Systems Work
  • Book: Ralph Kimball, Joe Caserta – The Data Warehouse ETL Toolkit
  • Book: Ralph Kimball, Margy Ross, Warren Thornthwaite, Joy Mundy, Bob Becker – The Data Warehouse Lifecycle Toolkit
  • Book: Martin Kleppmann – Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
  • Book: Luc Perkins, Eric Redmond – Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement
  • Book: Nathan Marz, James Warren – Big Data
  • Book: Mark Needham, Amy E. Hodler – Graph Algorithms: Practical Examples in Apache Spark and Neo4j
  • Blog / X: Airbnb Data – https://twitter.com/AirbnbData
  • Blog / X: Twitter Eng – https://twitter.com/TwitterEng
  • Blog / X: Netflix Engineering – https://twitter.com/NetflixEng
  • Blog / X: Uber Engineering – https://twitter.com/UberEng
  • Blog / X: Spotify Engineering – https://twitter.com/SpotifyEng
  • Blog / X: SQL Daily – https://twitter.com/sqldaily
  • Podcast: Software Engineering Daily – Data – https://podcasts.apple.com/ca/podcast/data-software-engineering-daily/id1232093653

DATA ENGINEERING – Who To Follow

The list below could be several times longer. I decided to focus on the people/companies/products that I visit most often and from whom I get the most information.

DATA ANALYTICS AND VISUALIZATION

Here, the matter is so complicated that you can fall into a “sect” of a product, e.g. Power BI / Qlik, and do well in a niche. However, once again I would advise you to focus on common things in the topic, i.e. business problems and options for solving them in various technologies.

Data Analytics and Visualization – Basics

Business Financial Metrics

Topics related to: Balance Sheets, Income Statements, Rolling Orders, Sales etc.

Business Metrics Warehousing/Logistics

Topics related to: On Time Delivery, Manufacturing Planning, Freight Costs + Shortest Paths etc.

In the above cases, I didn’t find great resources to help you get started, just general stuff. However, remember that in their basic form, these topics will be common between companies. They may be implemented/maintained and enforced differently, but conceptually the goals of these metrics will be the same.

Search for specific topics on industry blogs.

Appropriate presentation of results (design of the visualization layer)

Good visualization is an art in itself. It is not difficult to insert data from various sources and create X tabs in Excel and send it to the business and let it deal with it 😉

Good design is also often a conventional thing and the definition of “good design” may be different for everyone. However, there are some basic principles of data visualization that are worth knowing.

They are often described by producers of various Analytical / Business Intelligence solutions - see Qlik / Tableau or Power BI.

You can also take a look at the document prepared by me, which you will receive after subscribing to the Newsletter.

Available tools on the market

All kinds of lists can be helpful here, e.g.

etc.

Data Analytics and Visualization – Materials

Data Analytics and Visualization – Who to track

DATA SCIENCE | MACHINE LEARNING

I have the least experience in the area of data science / machine learning, so this is where I would seek advice from other specialists. The collection below is my purely subjective opinion. Workshops about which I have heard a lot of positive opinions or I have participated in them myself. People who are active in the community or popularize knowledge in this field. There’s definitely a lot that could be added to this section, but I’ll leave it to your curiosity.

DATA SCIENCE | MACHINE LEARNING – WHAT IS WORTH PAYING ATTENTION TO

  • SQL
  • Python
  • Notebooks (example: Jupyter)
  • Data Cleaning
  • Classifying Data
  • Feature Selection (Feature Engineering)
  • A/B testing
  • Neural networks
  • Asking good questions 🙂

DATA SCIENCE | MACHINE LEARNING – MATERIALS

DATA SCIENCE | MACHINE LEARNING – WHO TO TRACK

The above points do not exhaust the topic, of course, but it is certainly a solid basis for starting your adventure with data.

Thanks,
Krzysztof

Want to be up to date with new posts?
Use below form to join Data Craze Weekly Newsletter!

Data Craze Weekly

Weekly dose of curated informations from data world!
Data engineering, analytics, case studies straight to your inbox.

    No spam. Unsubscribe at any time.


    The administrator of personal data necessary in the processing process, including the data provided above, is Data Craze - Krzysztof Bury, Piaski 50 st., 30-199 Rząska, Poland, NIP: 7922121365. By subscribing to the newsletter, you consent to the processing of your personal data (name, e-mail) as part of Data Craze activities.


    This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.