If data in a source table such as CSV contains all records, including updated and new inserts, then there’s a bit more work to do, or at least considerations to take into account.

For instance, you have an Orders table from a MySQL database as a source table. This means that you can either extract the data in full or incrementally.

Full Load

If you extract data in full, then you have one small addition to make to your loading scripts. Truncate the destination table, say raw_orders, in Snowflake, before you run the COPY command.

Incremental Load

If you extract data incrementally, you don’t want to truncate the destination table. If you did, all you’d have left are the updated records from the last run of the extraction job. There are different ways you can handle data extracted in this way, but the best is to keep things simple.

In incremental loads, you can simply load the data using the COPY command (no TRUNCATE) and rely on the timestamp stating when the record was last updated to later identify which record is the latest or to look back at a historical record.

For instance, let’s say that a record in the source table was modified and thus present in the CSV file being loaded. After loading, you’d see something like Table 5-1 in the Snowflake destination table. Though it may feel uncomfortable to have multiple records for the same OrderID in the destination table, in this case it’s the right thing to do! The goal of data ingestion is to focus on extracting and loading data. What to do with the data is a job for the transform phase of the pipeline.

Example of incremental extraction:

-- Querying from `Orders` table in a MySQL database
SELECT * FROM Orders
WHERE LastUpdated > {{ last_extraction_run }}