I am currently working on a framework for ETL written in Ruby. For a brief description on why I am doing this, please read the linked post. To see all related content, view everything tagged "ETL Framework".
Now that we have some ETL Steps that we can build we need some logical way of running them. Favoring composition over inheritence, I went with a container class called the ETLProcess. The basic UML for the ETLProcess class along with its relationship to the ETLStep class is below:
[caption id="attachment_120" align="alignnone" width="467" caption="ETLProcess class with relationships to ETLStep classes"]
[/caption]
A lot can be said about why I chose this design, but I think the best way to talk about it is to actually show how it works. The following ruby code shows how you would use ETLProcess and ETLStep classes to build out your ETL:
require 'etl_lib'
etl = ETLProcess.new 'My ETL Process'
etl.add_step(MyETLStep1.new)
etl.add_step(MyETLStep2.new)
etl.start
Lets walk through the parts here to get an idea of what is going on. The first line is a simple require to make sure that we have access to the ETLProcess and ETLStep classes. Next, we create a new object called "etl" that is the ETL process. Then we add the steps to the etl object with the add_step method. A lot of behind the scenes stuff happens when this is done which I will go into in another post. Notice that when a new step is added to the process that we actually create a new object on the fly. When the add_step method is called, the ETLStep object is added to an array called steps. Why not just add them directly? Because we need to do a lot with that Step object before it is ready to be run.
The start method simply loops through the steps array and executes the start method on the step. I wanted to hide a lot of the background stuff from the actual ETL contained in the ETLStep, thus the two methods in the step, run and start. The start method does some background work and then executes the run method in the step. This keeps the development of new ETL sqeaky clean and focused on the actual ETL rather than on backend tasks like bookmarking and database connection handling.
The other interesting thing that the ETLProcess class does is that it tracks when there is an error in the ETLStep and controls whether the step should be rolled back or not. By default, the ETL will run the rollback method in the Step if it fails. To turn this functionality off, you can specify it as an option when you add the step to the etl object. For example:
etl.add_step(MyStep.new, {:rollback_on_fail => false})
Another way the ETLProcess controls the running of ETL Steps is by providing a global rollback method. That way, if you wanted, you could rollback the entire ETL process if something didn't work just right. Here's an example:
require 'etl_lib'
etl = ETLProcess.new 'My ETL Process'
etl.add_step(MyETLStep1.new)
etl.add_step(MyETLStep2.new)
etl.start
success = true
# code to check to see if the ETL ran as expected..
# set success to false if it doesn't look good
etl.rollback unless success
That's a quick introduction to the ETLProcess Class. The next big hurdle is how to create unit tests for the ETL steps. I'll show you how I tackled that problem in the next post.
Permalink
| Leave a comment »