What is Data Engineering

  • Data Engineering is the practice of taking raw data from a data source and processing it so it’s stored and organized for a downstream use case such as
    • data analytics
    • business intelligence (BI) or
    • machine learning (ML) model

Framework of Data Engineering

  1. Ingest
    • Data ingestion is the process of bringing data from one or more data sources into a data platform.
    • These data sources can be files stored on-premises or on cloud storage services (MS Sharepoint), databases, applications and, increasingly, data streams that produce real-time events.
  2. Transform
    • Data transformation takes raw ingested data and uses a series of steps (referred to as “transformations”) to filter, standardize, clean and finally aggregate it so it’s stored in a usable way.
    • Medallion architecture
      • is a popular pattern that divides transformation phase into three stages - Bronze, Silver, and Gold
      • Bronze - raw ingestion and history
      • Silver - filtered, cleaned, augmented,
      • Gold - business-level aggregates
  3. Orchestrate
    • Data orchestration refers to the way a data pipeline that performs ingestion and transformation is scheduled and monitored as well as the control of the various pipeline steps and handling failures (e.g. by executing a retry run)