Quantcast
Channel: Meanwhile, Back at the Farm...
Viewing all articles
Browse latest Browse all 3

Developing an ETL Framework

$
0
0

I have recently undertaken a project to write an ETL framework from the ground up.  Why would I want to do this?  In my study of the data warehousing world there are basically two options for building ETL.  1) you can buy a commercial off-the-shelf tool or 2) you can build your own using a combination of scripting languages and stored procedures.  If you know of anything else, let me know.  Here are some of the pros and cons of these options: Commercial Tool Pros:

  • Well tested and very powerful
  • Technical Support
  • Usually have a GUI to aid in building complex ETL
  • Scalable
  • Handles very large data movement
  • Uniform ETL Development and Maintenance
  • Prepackaged Transforms
  • Data Lineage
Commercial Tool Cons
  • VERY VERY Expensive
  • Overkill for many ETL initiatives
  • Heavy architecture requires specialized skills for maintenance
Scripting Pros:
  • Get something up very quickly
  • Very lightweight
  • Extremely flexible (you can write the ETL to do anything you want)
  • Very low cost of ownership (You can run scripted ETL on a VM if you want)
Scripting Cons:
  • Unwieldy maintenance
  • No scalability
  • Very little code reuse
  • No data lineage
Both solutions have their advantages.  If I had my preference, I would always use a commercial tool in combination with stored procedures (Oracle please).  Unfortunately, not all ETL projects have the budget for this so we compensate with scripting.  I love scripting, don't get me wrong.  In many ways, scripting was the first ETL tool long before data warehousing was entering the mainstream.  For the project I'm on, scripting was really the only solution.  Because I've used an off-the-shelf solution and learned a lot from its benefits, I was really dreading starting from scratch and losing all the built in functionality.  I decided, however, that I could get much of the functionality that I missed by building a simple ETL framework first.  Here is some of the functionality I wanted:
  1. Centralized database connection management (something like ODBC, but utilizing native connections)
  2. A simple database wrapper to take some of the pain out of the interfacing with the database
  3. Ability to reuse ETL steps
  4. Data bookmarking
  5. Integrated testing
  6. Reusable ETL tools
My first inclination was to look for a framework that already exists.  I didn't really want to use Perl because I have never really had a chance to learn Perl and I don't know it at all.  Under my current timeline, my best option was to write the ETL in Ruby.  The only Ruby based ETL framework that I could find was a plugin for Ruby on Rails called ActiveWarehouse.  This is a nice start, but it didn't have the functionality I needed and I didn't want to waste time digging through someone else's code trying to figure out how the thing worked.  So, I finally decided to write my own.  What has come out of this is a very interesting framework for building, testing, and deploying ETL that has all the virtues of scripted ETL with some of the advantages of the big boys.  In the coming weeks, I will be writing about some of the design decisions I've made and about my experience developing the ETL framework.  I hope the tiny community that I have can help me flush out some issues and provide me with some constructive criticism so that what comes out of this project is actually something that can be a big benefit to others in similar ETL building situations.

Permalink | Leave a comment  »


Viewing all articles
Browse latest Browse all 3

Trending Articles