My thoughts on the Microsoft’s ETL course
This course is the third course of Microsoft Power BI’s specialization. according to the course information:
What you’ll learn
How to set up a data source and explain and configure storage modes in Power BI.
How to prepare for data modeling by cleaning and transforming data.
How to use profiling tools to identify data anomalies.
How to reference queries and dataflows and use the Advanced Editor to modify code.
Visit my LinkedIn to view my other certificates.
Week 1 :
first, we go over the objectives of this course:
- identify, explain, and configure multiple data sources in Power bi
- clean and transform data using power query
- inspect and analyze ingested data to ensure data integrity
what challenges do businesses usually have considering their data? how to interpret data
We sometimes have flat files such as CSV, text, or Excel and other times we have databases.
for data storage, we have three options:
- import: this mode is appropriate for small data and it stores the data in memory
- direct query: used for larger data
(you can switch between these two modes)
3. Dual mode: you let Power BI decide which is the best mode
We have also three types of data:
- structured: aka relational data
2. unstructured: Image, video, audio
3. semi-structured: They have tags, hierarchy, and so on
Connector: they combine data sources and link data sources to applications.
Now we explore triggers and actions: we can schedule to refresh data and then email us the sales report. With triggers, we can initiate a work and run it based on time intervals. (you can adjust refresh time in Power BI’s setting)
Week 2:
We clean and transform data to ensure accuracy and transparency. Structured data has fewer inconsistencies. Unstructured data is more likely not to be clean.
Power Query is a data transformation tool to ensure our data is ready to be used. Its features include data connectivity, data extraction, and transformation. Power Query is a GUI (Graphical User Interface). You can modify, undo, and reorder the steps.
We have three approaches to deal with missing values:
- remove the records
- we can replace the value with a default value
- we can replace the value with another column
Now, we want to change the shape of our tables:
- unpivot: we make our wide table narrow
- pivot: we make our narrow table wide
then we can summarize and aggregate data to make it more clear. Furthermore, we can combine data to enhance data analysis why would we combine our data:
- consolidate data from various sources into a single table
- create relationships
- enrich data by adding additional features
- broader scope for analysis
- simplify data management
There are two ways to combine our tables:
- append: increase rows
- merge: increase columns
And we can join our tables. This process is simple when we have the same columns, but when we have different columns, we should specify the method of join.
Week 3:
Loading enables transformed data to be available for reports. We have two main ways of loading data: loading it directly and transforming it before loading it.
Intermediate storage is temporal storage between the data warehouse and the source. Basically, It’s like preparing ingredients for a meal, in which ingredients are raw data, and preparing is like the ETL process.
Column profile provides min, max, average, frequency of values, and standard derivation. we plan and document our desired output, identify our data sources, outline transformations, and document and store data source credentials.
we also have two error-handling techniques: conditional logic and error message.
Data flow simplifies data transformation and it offers a visual interface. Advantages of data flow include reusability, centralization, collaboration, and scalability. Disadvantages or limitations are data refresh, data flow management, and advanced transformations.
Reference queries are when we use the same query for different data. We establish a relationship between existing and new queries. It provides cohesion.
Data flow vs. reference query: data flow depends on the scenario. Reference query has slow data refresh and it could overburden data sources, but it could optimize and design usage and lower query layers.
Query parameters allow users to input values that can dynamically change and be utilized in various operations.
In Power BI’s advanced editors, we can write, modify, and debug M code and perform data transformations, calculations, and aggregations.
Week 4:
The final course project’s data comes from a hypothetical company called “Adventure Works” and it gives you a chance to apply what you’ve learned so far.
In this exercise, you will apply your knowledge in an end-to-end scenario by using Power Query to clean and transform multiple data sources and join and merge them. You’ll also examine the valid, error, empty, min, max, unique, and distinct values in the rows. This will allow you to identify the anomalies in the data. Finally, you will remove the data sources with anomalies. This exercise will help you understand how to clean, transform, join, and merge data sources in Power Query, and identify potential data anomalies by using data profiling tools.