ETL Pattern: Staged Refresh
Here is a fairly common challenge: perform a complete refresh of the data in a target table. It seems pretty basic and easy. All of the existing data is replaced with the new load. There is no change detection required; just a simple load.
Yet, more often than not, when I go into a client that is having ETL issues I find that they have tackled this challenge with a poor design. A poor design that ultimately results in errors and ongoing headaches for their support staff. This is so common that I sometimes wonder if I have been following the same shortsighted developer from client to client.
This post is my attempt to jump ahead of this mystery developer and hopefully prevent some future headaches. Essentially, the poorly designed ETL process follows this pattern:
So, what is wrong with this? First, by dropping the existing table it effectively makes the table unavailable until the load completes. This can be particularly troublesome when the load process is lengthy. If the load is scheduled during off hours this might not be a problem. Yet, this downtime can be avoided.
Second, if the load fails you are left with an empty table. In my mind this is the most egregious issue with this pattern. An error leaves you with nothing. This can easy be the difference between an emergency and a simple annoyance. If the process had left the original data in place it would be old and stale but at least it is still accessible.
Here is a modified pattern that helps reduce the issues with the original:
This is a classic pattern for handling a complete refresh. The load itself is completed against a staging table. This leaves the existing target in place and doesn’t interrupt use while the load is running. The target table is only ever dropped after a quality check has been performed. This helps ensure that the load was successful and should prevent the empty table issue that the original process had.
There is nothing complicated about this pattern. It is still simple and straight forward. But, unlike the original, some forethought is put into dealing with failure. I hope this encourages you to take a look at how your ETL deals with problems. Does it handle it gracefully, or does it leave you with an emergency?





Pierre, first of all, thank you for great blog. I am new to BI, so there is a lot to learn. I have a question about ETL pattern I have noticed at one of the client sites. There are 2 databases, DW and DW_Staging. First everything gets truncated in DW_Staging and gets reloaded, then there is a check for successfull load, after which they truncate everything in DW database and load it from DW_Staging. I was wondering what your thought would be on this sort of design. Is it something that I should correct or essentially since it is working, leave it alone? What are implications?
Ultimately, the solution needs to be tailored to the situation and there isn’t going to be a blanket answer. However, there are a few things I would immediately look at when evaluating it. First, I’d try to understand why a full load is being performed. Generally a data warehouse is there for historical data. If the data is historical, why is it being refreshed in full each time?
There are legitimate reasons for doing a complete reload but more often than not I find that it is due to lazy design. It is just easier to reload everything than it is to detect changes. Maybe, if I get some time I will do some posts on common change detection patterns for SSIS.
Back to the design that you describe: I’d worry about the down time and possibility of error resulting in an empty table for end users. In the design described in the blog post the data is fully loaded and verified on the target system before the existing data is dropped. And, finally, the staging table is simply re-named to replace the target table. All of these procedures are very quick and reduce the chance for error. It isn’t perfect, but it is fairly bulletproof.
In your situation, it sounds as though the data is loaded onto one server/database (DW_Staging), verified, and finally moved to the target server/database (DW). Even if the databases co-exist on the same server it is going to take time to move data. And, this moving takes place after you have deleted all of the existing data. So, I would argue that it isn’t as good of a solution.
Anyhow, like I said, the solution is going to depends on the details of your situation. But, this should at least give you something to think about. Thanks for the comment and good luck!
Thank you for your response!