π Chapter 2: Data Engineering Workflow & Core Concepts
This chapter outlines the core workflow of data engineering and highlights the key roles, concepts, and processes that underpin the modern data lifecycle.
π§± Three Main Pillars of Data Engineering
The data engineering workflow can be conceptualized through three main pillars: 1. Data Production (Ingestion) 2. Data Transformation 3. Data Serving
1οΈβ£ Data Production (Ingestion)
The entry point of raw data into an organizationβs ecosystem.
Sources: - User interactions (e.g., apps, web, IoT) - APIs and external systems - Application logs, relational databases
Characteristics: Messy, incomplete, unstructured or semi-structured.
graph TD
"Mobile App Usage" --> "Raw Data Sources"
"E-commerce Transactions" --> "Raw Data Sources"
"IoT Sensors" --> "Raw Data Sources"
"External APIs" --> "Raw Data Sources"
"Databases (OLTP)" --> "Raw Data Sources"
"Application Logs" --> "Raw Data Sources"
"Raw Data Sources" --> "Data Ingestion"
"Data Ingestion" --> "Raw Data Lake"
style "Raw Data Lake" fill:#f9f,stroke:#333,stroke-width:2px
2οΈβ£ Data Transformation
Often consumes 70β80% of a data engineerβs effort.
Processes: - Cleaning, validating, deduplication - Standardizing and formatting - Applying business logic - Aggregation and filtering
graph TD
"Raw Data Lake" --> "ETL/ELT Engine"
"ETL/ELT Engine" --> "Curated Data Warehouse"
subgraph "ETL/ELT Engine"
"Clean Data"
"Validate Data"
"Standardize Formats"
"Remove Duplicates"
"Aggregate & Filter"
"Apply Business Logic"
end
style "Curated Data Warehouse" fill:#bbf,stroke:#333,stroke-width:2px
3οΈβ£ Data Serving
Delivering refined, trusted data to downstream consumers.
Consumers: - Data Analysts - Data Scientists - ML Engineers - Business Leaders
graph TD
"Curated Data Warehouse" --> "Data Serving Layer"
"Data Serving Layer" --> "Data Consumers"
subgraph "Data Serving Layer"
"Custom Models"
"APIs"
"Reporting Tools"
end
subgraph "Data Consumers"
"Analysts"
"Scientists"
"ML Engineers"
"Business Leaders"
end
style "Data Consumers" fill:#ccf,stroke:#333,stroke-width:2px
π₯ Key Roles in the Data Ecosystem
graph TD
"Data Sources" --> "Data Engineers"
"Data Engineers" --> "Data Analysts"
"Data Engineers" --> "Data Scientists"
"Data Scientists" --> "ML Engineers"
"Software Engineers" --> "Data Sources"
"DBAs" --> "Data Sources"
"Data Engineers" -- "Enable" --> "Data Analysts"
"Data Engineers" -- "Support" --> "Data Scientists"
"Data Scientists" -- "Feed Models to" --> "ML Engineers"
"ML Engineers" -- "Integrate into" --> "Software Engineers"
π Upstream vs. Downstream Data
Upstream = Data producers
Downstream = Data consumers
Bridge = Data Engineers
graph LR
"User Actions" --> "Front-end Apps"
"Front-end Apps" --> "OLTP Systems"
"OLTP Systems" --> "Data Engineers"
"Data Engineers" --> "Data Warehouse / Lake"
"Data Warehouse / Lake" --> "Data Analysts"
"Data Warehouse / Lake" --> "Data Scientists"
"Data Warehouse / Lake" --> "Business Leaders"
style "OLTP Systems" fill:#add8e6,stroke:#333,stroke-width:2px
style "Data Engineers" fill:#ffb3ba,stroke:#333,stroke-width:2px
style "Data Warehouse / Lake" fill:#90ee90,stroke:#333,stroke-width:2px
π§ Summary
This chapter emphasized: - The three main stages of data handling - The collaborative roles in the data ecosystem - The flow of data from source to consumption
Understanding these fundamentals ensures that every data engineer can design systems that are scalable, collaborative, and business-impactful.