Square pegs and round holes: A reflection on Class::DBI (OSDC 2006)

Presentation (142KB powerpoint)

Stephen Edmonds
Senior Portal Developer / Integrator
Flexible Learning and Teaching Program
Information Technology Services
Monash University

Class::DBI is a database framework for Perl that makes it easy to get a project up and running. Unfortunately it is also a complex beast with a lot of dependencies that cannot always be manipulated into the shape that the user needs.

A reflection on two developments:

These two examples show that adopting an established framework is not always the best path to take as the needs of the organisation must take precedence over the needs of the framework.


Introduction

Class::DBI is a framework for the Perl programming language that enables a database backed project to quickly get off the ground, potentially without writing a single line of SQL.

Even a cursory glance at the documentation for Class::DBI demonstrates its capabilities in regard to supported relationship types, data validation, supported databases, and the ability to extend beyond the base functionality. Unfortunately with this power comes a great complexity that, especially when working in a manner that doesn't quite match, can potentially undermine a project that may not exist without Class::DBI.

Fitting the square peg into the round hole

During 2004 Monash University undertook a trial of a commercial Learning Management System (LMS) for the purposes of delivering online teaching materials. From a trial during semester one we determined that a 'middleware' solution was needed that drew information from the commercial product and existing systems to provide:

At the core of our solution was an Oracle database that replicated data from all relevant systems in order to relate the data in a robust manner. Due to time constraints we chose to use Class::DBI to implement classes that mapped to the database tables. This formed the foundation for a web based interface and a series of batch processes to synchronise data between the systems and the middleware.

Initial development

Clash of the standards

Out of the box Class::DBI assumes a particular set of database conventions. For example when setting up a has_a relationship the name of the column that contains the foreign key becomes the name of the method to return objects of the foreign class. This is demonstrated by an example featured in the Class::DBI documentation:

Music::CD->has_a(artist => 'Music::Artist');

print $cd->artist->name;

Our naming convention for relationships is that the column containing the foreign key is named the same as the primary key of the related table. This works well for us as our standard for naming primary keys is to use the name of the table suffixed with '_id'. The example modified for our standards:

Music::CD->has_a(artist_id => 'Music::Artist');

print $cd->artist_id->name;

The interface that is now produced features the artist_id() method which should be assumed returns an id, not an object of a related class. To compensate for this we follow every has_a() declaration with a modification to the symbol table to alias the Class::DBI provided method with our preferred name:

Music::CD->has_a(artist_id => 'Music::Artist');
*artist = \&artist_id;

print $cd->artist->name;

While this provides the interface we desire, a line of code such as this increases the learning curve of a developer working on the class as it is far from commonplace to modify the symbol table in everyday development.

As we are also using the accessor_name() and mutator_name() methods to separate our accessors from our mutators we actually require two modifications to the symbol table:

Music::CD->has_a(artist_id => 'Music::Artist');
*get_artist = \&get_artist_id;
*set_artist = \&set_artist_id;

print $cd->get_artist->name;

We have since learnt that a much more robust solution is to remove trailing '_id' fragments from column names globally through accessor_name() and mutator_name():

sub accessor_name
{
    my ($class, $column) = @_;
    $column =~ s{ _id \Z }{}smxgi;
    return q{get_} . $column;
}

sub mutator_name
{
    my ($class, $column) = @_;
    $column =~ s{ _id \Z }{}smxgi;
    return q{set_} . $column;
}

Although this assumes that a trailing _id on any column name can be safely removed and increases the disconnect between the declaration and the resulting interface it does allow the following declaration:

Music::CD->has_a(artist_id => 'Music::Artist');

To produce a get_artist() and set_artist() method.

One nice thing that Class::DBI does is to map whatever the primary key of the table is to an id() method on the class. accessor_name() does not apply to this so a method was required in order to continue our consistent interface of get_ methods:

sub get_id
{
    my ($self) = @_;
    return $self->id();
}

Unlike our symbol table modification to remove _id from method names we were able to write this once in the top level class.

Playing nice with friends

my.monash is a web portal developed using Apache, mod_perl and HTML::Mason that provides a personalised experience to the staff and students of the university. It is here that we provide an application through which staff manage the details and access control of their sections within the LMS. As development of this application was nearing completion we begun using the classes in various areas of my.monash to determine whether or not the logged in user would be provided with links or other information. This proceeded with little issue until one day the database was not available.

As we knew that the default error handling behaviour of Class::DBI was to throw an exception we had dutifully wrapped any usage of the classes outside of the application with an eval in order to handle the error gracefully. We had not treated the use statements in a similar manner which meant that instead of a page missing one of twenty elements, the entire page was replaced by an error message when the database was unavailable.

While the solution for this is as simple as another few eval statements, the greater problem for us was regarding expectations. Every Perl module that we currently use, both those available on the CPAN and those we developed in house, to obtain or manipulate data during the generation of a page within my.monash follow an error at runtime paradigm. That is, you can load the module as far in advance as needed but it is only when a function is called or a class instantiated that work begins. It this point an error may be returned or thrown.

Class::DBI is built upon Ima::DBI which follows an error on load paradigm. As soon as Ima::DBI is loaded and configured with database settings it attempts to connect to the database and if there is a problem, for example the database being unavailable, an exception is thrown which prevents further loading of the calling code. We found this paradigm appropriate in our batch processes as it makes no sense for the script to continue to run if there is no database to manipulate.

This deviation from our normal expectations has meant that code to provide our web interface that uses these classes must be of a more defensive nature in order to allow other code to proceed uninterrupted.

Update 22 November 2007: An extended downtime of the database led us to reinvestigate this issue and we found that we were causing it by changing the default date format to include the time. -- A correction to my paper from OSDC2006

I canna do it captain!

One batch process compares a daily extract from our directory service against the database in order to apply any changes in regard to id numbers, names (including uid, given and surname) and user types. The nature of this problem requires at least one of the data sets, from the extract or from the database, to be entirely in memory at one time. Records are removed from memory as they are processed which allows any items remaining at the end of the run to be processed appropriately.

At first this seemed to be the straightforward task of writing a script that loaded the extract into memory and then used retrieve() to obtain the object to compare against. Unfortunately when this was run on a real extract that contained upwards of ninety thousand records the overhead of performing that many separate SELECT queries took its toll with a runtime of over an hour. This was not acceptable.

The second approach continued to use Class::DBI except that all of the objects were first loaded into a hash at the beginning of the run using retrieve_all(). The reasoning behind this was that a single query on the database would be more efficient than thousands of separate ones. There was an improvement in runtime but it was not significant and crucially it varied between runs, even runs that did not make any changes to the database. We had replaced a database bottleneck with a memory allocation problem which was again not acceptable.

Our next approach was to avoid the classes and perform a direct select on the database in order to obtain a large hash of the data, essentially the same data in memory as the second approach but just as hash references, not objects. The script then compares each record in the extract against the information in this hash and only if there is a difference will the appropriate object be retrieved and modified. As the six minute, on average, runtime of this version was acceptable, and a staggering improvement, it became the final approach we tried.

In order to achieve acceptable performance in this case we were required to violate the database abstraction provided by Class::DBI. The purist cringes...

Maintenance and further enhancement

In the two years since initial deployment of the middleware to production a significant number of changes have been applied to correct bugs as they were encountered, to improve performance and to integrate with additional university systems. Some of these changes have been more problematic to implement than expected.

Where is the object?

All objects of classes built upon Class::DBI automatically stringify to the value of their primary key. While the Class::DBI documentation states that this is a significant convenience, in our experience we have found that this has caused unnecessary debugging of a non existent issue.

As our preferred method of accessing the primary key of an object through its get_id() method we did not include in our documentation the behaviour in string context. Over past two years of supporting and extending this product there have been a few situations where developers who where unaware of this behaviour investigated why methods were returning an id instead of an object. Ultimately they realised that it was their method of debugging (for example print or warn statements without using Data::Dumper) that was the issue.

Unforeseen implications

In order to ensure a robust audit trail our database schema was designed with a corresponding history table for each table. Triggers within the database automatically copy records into the history tables upon insert, update and delete actions. To ensure that these actions are attributed to the appropriate user we redefined the create(), update() and delete() methods to require a username argument which is supplied to a PL/SQL procedure for use by the triggers.

In the majority of cases this has been successful with only a few instances where it was awkward to provide an appropriate username. However when we attempted to take advantage of more complicated features of Class::DBI such as the add_to_ methods or the cascading delete ability we found that our username requirement was now a hindrance because when these methods call create(), update() or delete() internally they fail as no username is supplied.

If our knowledge of how Class::DBI operates internally was more complete it may have been possible for us to implement our requirements in a different manner that would not prohibit these functions. As it currently stands, we are in a situation where knowledge of these functions can be frustrating as they cannot be used.

Building a round peg

'Course Finder' is a web based application that enables prospective students of the university to search for, and obtain information on, courses that are available to study. It consists of the following components:

In mid 2005 we were tasked with implementing a number of enhancements to the 'Course Finder' application that had arisen from a series of usability tests that had been performed since the application had gone live the previous year. Unfortunately what had been developed did not lend itself to modification so our first task was to refactor the entire application in order to bring it up to standard.

Class::DBI or not Class::DBI... That is the Question

Once we had engineered a new data model we realised that the nature of the information represented in the database lent itself to the approach of mapping a series of classes directly onto the database tables. At first our inclination was to use Class::DBI however there were two mitigating factors:

It lives!

Shortly after we began implementing the classes, with DBD::Oracle as the only non-standard module, we realised that some form of framework would be required in order to avoid a copy and paste operation for twenty classes.

As with Class::DBI we first inherit from a common class:

use base 'Monash::CourseFinder::Base';

The base class defines the database settings as well as the framework which contains common creation and retrieval methods as well as a helper through which the class attributes and relationships are declared.

First we specify the name of the table:

__PACKAGE__->_set_table_name( 'course' );

As our framework is tailored for our database standards, information such as the name of the primary key and the name of the sequence for primary key values can be determined from the table name. This class method makes all of this information available through closures installed in the appropriate package.

Declare table columns:

__PACKAGE__->_setup_attribtes(
   qw{
      code
      name
      cricos_code
      duration
      ...
   },
);

Unlike Class::DBI only the attributes that are purely data, ie not objects of related classes, are specified in this manner. get_ and set_ closures are installed for each attribute.

Define a has_a relationship with another class:

__PACKAGE__->_setup_has_a_relationship(
   'class' => 'Monash::CourseFinder::Faculty',
   'get_method' => 'get_faculty',
   'set_method' => 'set_faculty',
);

This is very close to has_a() from Class::DBI except that the method names are manually defined which provides us with flexibility in what to call the name which includes omitting the set method entirely. Internally, the implementation of the methods is dependant on the database following our standard of the foreign key being named the same as the primary key of the foreign table.

More complicated is our version of defining a has_many relationship:

__PACKAGE__->_setup_has_many_relationship(
   'class'         => 'Monash::CourseFinder::AreaOfStudy',
   'table'         => 'course_area_of_study',
   'get_method'    => 'get_areas_of_study',
   'add_method'    => 'add_area_of_study',
   'remove_method' => 'remove_area_of_study',
);

This declaration is one side of a many to many relationship which involves an intermediary table (course_area_of_study). The flexibility of manually specifying the method names is evident in this case due the plural form used for the get method.

When declaring a one to many relationship the same class method is used except that the table specified is not an intermediary as can be seen in the declaration on the other side of the has a declaration above:

__PACKAGE__->_setup_has_many_relationship(
   'class'         => 'Monash::CourseFinder::Course',
   'table'         => 'course',
   'get_method'    => 'get_courses',
);

In this case the add and remove methods are not specified as they are not required in the object interface.

With these four utility methods we were able to declare a complete set of classes to represent the entire database schema. Twelve of the current twenty one classes remain in this state with the other nine featuring additional methods to perform specific data manipulations.

It was also interesting to note that as we continued development we were implementing more and more of the features found in Class::DBI with a significant exception, we only implemented the features we needed and we implemented them to fit exactly with our standards.

Was it worth it?

One of the first questions that is asked when people learn that we developed our own simplified version of Class::DBI is whether it was worth the effort. This is a difficult question to answer as a number of the factors, such as ease of use and confidence, are qualitative not quantitative.

We can say that adopting Class::DBI would have incurred a lower up front cost. But what about ongoing costs? It has been demonstrated that a developer that is familiar with our standards and practices can quickly make changes to the 'Course Finder' database and classes. If we had adopted Class::DBI this developer would first need to understand the standards and practices of Class::DBI.

This argument in favour of our framework can be turned around by saying that the cost of each developer understanding Class::DBI has at least two benefits. First they will then be able to work on another Class::DBI based application with minimal up front cost and secondly they will be exposed to alternate practices that should ultimately produce a more efficient developer.

If pushed to give an answer one way or another our answer is that we are in a better position now than we were before due to two factors; we did not have to deploy a copy of Class::DBI and its dependencies and the act of creating a framework has given us experience that we can feed back into our future work.

Another peg, possibly triangular

This paper has so far have referred to two database abstraction layers, the popular Class::DBI that we bent into our desired shape and our home grown framework that is tailor made to our conditions. Another significant alternative to these is DBIx::Class which was released in August 2005 and since that time it has been put forward to be used in a couple of our projects. As these projects were best served by a simple procedural interface over an object-oriented interface we have not yet had a detailed look at DBIx::Class.

The simple answer to why we did not consider DBIx::Class for either of the 'middleware' or 'Course Finder' applications is a temporal one:

Although our Class::DBI solution for the 'middleware' has proven to be robust we are currently in the early stages of a project to redevelop the solution, possibly as far as starting again from scratch. The driver for this is a combination of enhancing our ability to respond to user requests, altering the system model to match with the current user model and to take advantage of features provided by the next version of the commercial product. It is too early to say whether we will stick with Class::DBI (either extend or start over), switch to DBIx::Class or even adopt our own framework. It can be said that each alternative will be considered on its merits.

Conclusion

These examples show that adopting an established framework is not always the best path to take as it is unlikely that any standardised framework will perfectly match the requirements of the application where it is put to use. Conversely they show that building your own solution may end up duplicating a significant amount of the various frameworks that are available.

We also found that the experience of implementing something as potentially complex as a data abstraction layer would be beneficial to the growth of the development team even if it were done purely as an exercise. Experience can teach much more than documentation.

Ultimately the needs of the organisation must take precedence over the needs of the framework. If the solution is time critical then the best option may be to adopt the established framework and work around the differences. On another hand if quality of the solution is critical to the organisation then any path must be considered irrespective of the cost.