I am currently working on a framework for ETL written in Ruby. For a brief description on why I am doing this, please read the linked post. To see all related content, view everything tagged "ETL Framework". The foundation of my ETL framework is the ETL step. An ETL step is where the developer actually programs a specific task for the ETL to perform. An example would be to have the ETL download files from an FTP site or to load one or more tables that are related. The key here is that the task be at a level that you want to track progress. Some steps may even be re-usable. For example, you may want to have a step that loads files from a specific location into a processing queue for later steps to parse and load. You may want just one step with which you can pass a parameter that indicates where the source files live and then just execute that task several times. Since Ruby is, by its nature, object oriented, the ETL should be designed in an object-oriented manner. Here is a simple class diagram of the ETLStep class. [caption id="attachment_103" align="alignnone" width="265" caption="ETLStep Class Diagram (inherited properties omitted for subclasses)"]
[/caption] The superclass, or parent class, is where all the functionality of the class lives. Here we can control bookmarking and meta information about the ETL step. Notice also that ETLStep is abstract. As in Rails and other Ruby based frameworks, in order to actually write ETL in this framework you actually create a subclass of ETLStep. The subclasses require only two methods, run and rollback. The run method is where you develop your ETL process and the rollback is where you specify instructions on how to rollback that step. The other methods and attributes in the parent class are for additional functionality that I will talk about in another post. This design gives us some very basic, but powerful functionality when building ETL. Each step has responsibility for all the ETL that it performs. If it should fail in any way, the rollback method should clean up after the task leaving the data environment just how it was before the task ran. Also, each ETL step inherits the ability to perform logging, access database connections, and access global information from the entire ETL process. The ETL developer doesn't have to build this, it just comes through inheritance. So now all that is needed is a way to execute the run method. This is done by the ETLProcess object. In the next post, I'll talk about the ETLProcess class and how to actually create and run the ETL.↧