Developing the Monash Research Directory (OSDC2004)

Presentation (1.1MB powerpoint)

Stephen Edmonds
Portal Developer / Integrator
Flexible Learning and Teaching Program
Information Technology Services
Monash University

The Monash Research Directory was created to raise the profile of research activities and researchers at Monash University. Written in perl, using readily available modules, the Monash Research Directory queries two disparate data sources, one a commercial application containing research information and the other being a directory of all students or staff members of the University. The two sets of results are combined into a unified whole that is presented through two web based interfaces, one available to the general public and the other restricted to current students or staff members of the University.

Introduction

Monash University produces a large amount of research output every year which encompasses such items as books, journal articles or conference papers. The majority of these have their details entered into a central system in order to qualify for funding.

It was proposed that a publically available directory of this information, coupled with biographies of the authors, could raise the profile of research at Monash and potentially contribute to an increase of collaborative research opportunities. This proposal was manifested as the Monash Research Directory.

Research Master

Representatives from each faculty of the University enter the details of research output from their faculty into a commercial product called Research Master, the administration of which is one of the many services provided by the Research Grants and Ethics Branch of the University.

There are two modules within Research master that are relevant to the Research Directory. The publications module contains details of research output that has been published and the personnel module contains details of the authors of the research output. There are additional modules within Research Master which were excluded from the Research Directory as they contain information regarding ongoing research projects.

The composition of Research Master is that of a Windows client application communicating with an Oracle database. Fortunately this means that the standard perl database module (DBI) can be used in conjunction with the Oracle database driver (DBD::Oracle) to perform searches within this data source.

Monash Directory Service

The Monash Directory Service contains an entry for every current student or member of staff at the University as well as entries for specific roles or external users. The majority of details are updated automatically from systems such as the payroll system or the internal telephone directory which means it is potentially the most up to date centralised source of person information in the University.

Staff members are provided with the ability to add more additional information to their own record. This additional information includes research interests, academic supervisory activities, community involvement, professional associations and a biography. This is exactly the value added information that was envisaged as being included in the Research Directory.

The Monash Directory Service is a standard LDAP service that can be queried using the Net::LDAP perl module.

Requirements

There were a number of requirements identified early in the development process:

Utilise data from existing systems.
Present the most up to date information possible.
Only display research output generated by current monash staff.

By querying Research Master and the Monash Directory Service, the Research Directory takes advantage of the two centralised systems that contain the most up to date information regarding published research output and personal details.

As the Monash Directory Service only contains entries for active accounts, or those transitioning to or from active accounts, it was determined that it would be possible to correlate the results between Research Master and the Monash Directory Service in order to only present the publications with at least one current staff member of the University.

An additional requirement that arose was that there should be a staged release with the Research Directory available first to Monash users and then later available to the general population. This was achieved by developing two interfaces to the Research Directory, one within the my.monash portal, built upon the web site authoring system HTML::Mason, which requires authentication and one via the central monash website where a variety of technologies are available, however a perl CGI solution was chosen.

Modules

The problem thus far was one of how to provide user interfaces in two environments to present data gathered from two independent systems:

system diagram

The solution was to construct a set of object oriented perl modules to perform the bulk of the business logic, in a sense the glue between the interfaces and the back end systems. A hierarchy of three classes was designed to represent the fundamental aspects of the Research Directory; the directory itself, an author and a publication. As a single author can contribute to more than one publication and a single publication may have had multiple authors contributing towards it the relationships can be visualised as a triangle:

class diagram

The ResearchDirectory class has responsibility for querying both backend systems and then processing the results in order to construct the appropriate Author and Publication objects. Once the search is complete all that is required of the interface is to iterate over the appropriate set of objects and output the formatted information. For example, the code to produce the output for an author name search:

    my $research = Monash::ResearchDirectory->new( ... );

    if ($research->search('name' => $query))
    {
        foreach my $author ($research->authors())
        {
            print $author->name(), "\n";

            foreach my $publication ($author->publications())
            {
                print $publication->title(), "\n";
            }
        }
    }

The code to produce the output of a publication title search is strikingly similar to that for an author name search:

    my $research = Monash::ResearchDirectory->new( ... );

    if ($research->search('title' => $query))
    {
        foreach my $publication ($research->publications())
        {
            print $publication->title(), "\n";

            foreach my $author ($publication->authors())
            {
                if ($author->is_monash())
                {
                    print $author->name(), "\n";
                }
            }
        }
    }

An additional step required when presenting the details of a publication is that each author needs to be tested for whether they are a current staff member of the University. In this example only authors who are a current staff member of the University are printed. In the production version this test determines whether the author name is also a link to the author's profile.

In addition to what could be called the true search functionality, the search() method also gives the interface the ability to look up a specific author or publication by performing a search based upon the unique identifiers used within Research Master for authors and publications: 'cperson' and 'cref'.

An interesting aspect of how Research Master stores publication details is that there are actually more fields to store than there are fields in the database. This is due to fact that the information required to describe a publication in one category can be different to the information required in a different category. The majority of the fields are optional but there are custom fields that contain information with values such as the ISBN, the name of the conference, or the book editors. The solution was to retrieve the description of each field from the Research Master database along with the publication details and provide access through a generic method:

    foreach my $field ($publication->fields())
    {
        my ($name, $value) = $publication->field($field);

        if ($value)
        {
            print $name, "\t", $value, "\n"; 
        }
    }

    foreach my $author ($publication->authors())
    {
        print $author->name(), "\n";
    }

The interface on an author object is even simpler than that on a publication object, as apart from methods for the parts of the name, the Research Master unique person id, and the related publications, all the details are returned in a single hash reference:

    format_ldap_entry(
        'entry' => $author->ldap_entry(),
    );

    foreach my $publication ($author->publications())
    {
        print $publication->title(), "\n";
    }

This was done as there was an existing component within the portal to format a user's LDAP entry with the appropriate look and feel. A function was created in the perl CGI version to duplicate the portal functionality.

It should be noted that the design of the system is actually language agnostic as these classes could be implemented in any object oriented language that has libraries for querying Oracle databases and LDAP services.

Implementation

At first the logic required to query both data sources and process the results appeared to be a daunting prospect. However, once broken down into the component parts, it is fairly straightforward.

For a search against a publication title the first action is to transform the search argument into a form suitable for use as the condition in an SQL LIKE statement and then perform the query. Due to performance issues at the time it was determined that a single query that gathered all the required information was the best solution. However, this does mean that some information is repeated multiple times. For example, the complete details of a publication is repeated against each author of the publication. To allow for this the Author and Publication objects are only created the first time they are required during processing:

    while (my $row = $sth->fetchrow_hashref('NAME_lc'))
    {
        my $author      = $self->_find_or_create_author($row);
        my $publication = $self->_find_or_create_publication($row);

        $author->add_publication($publication);
        $publication->add_author($author);
    }

At this point there now exists a set of authors that relates to another set of publications. In order to obtain the up to date information about the authors an LDAP filter is constructed for use in querying the Monash Directory Service:

    my @employeenumbers = map { $_->employeenumber() || () } $self->authors();

    my $ldap_filter
        = q{(|}
        . join q{}, map { qq{(employeenumber=$_)} } @employeenumbers
        . q{)}
        ;

Results from the Monash Research Directory are then attached to the appropriate author object:

    foreach my $author ($self->authors())
    {
        if (my $entry = $self->_get_ldap_entry($author->employeenumber()))
        {
            $author->set_ldap_entry($entry);
        }
    }

It is now possible to remove the publications which do not have at least one current staff member of the University from the results:

    foreach my $publication ($self->publications())
    {
        unless (grep { $_->is_monash() } $publication->authors())
        {
            $self->destroy_publication($publication);
        }
    }

The process of destroying a publication is to remove the relationships it has with authors and then remove the object itself from the internal list:

    foreach my $author ($publication->authors())
    {
        $publication->remove_author($author);
        $author->remove_publication($publication);
    }

    delete $self->{'_publications'}->{ $publication->get_id() };

As with a number of sections of code within the Research Directory, the author related variant is merely a reversal of the publication variant:

    foreach my $publication ($author->publications())
    {
        $author->remove_publication($publication);
        $publication->remove_author($author);
    }

    delete $self->{'_authors'}->{ $author->get_id() };

The final processing step is to remove the authors in the results who no longer have any publications:

    foreach my $author ($self->authors())
    {
        unless ($author->publications())
        {
            $self->remove_author($author);
        }
    }

The process for looking up a specific author or publication is the same as that for searching for a publication except that the search condition is an exact match for the unique Research Master id. However a search for an author based upon their name first queries the Monash Directory Service and uses the results to build a query against Research Master. This is done as the author information in Research Master is not automatically updated, as it is in the Monash Directory Service, which could mean that an author search could fail to find the desired results.

At no point does the set of Author and Publication objects in existence at any point in time represent the complete Research Directory. As well as this being impractical due to the sheer number of publications entered into Research Master it is also unnessesary as the mechanism described above will instantiate the appropriate objects to produce:

A list of publications and their authors.
A list of authors and their publications.
A publication and its authors.
An author and their publications.

Representing complicated scientific formula in HTML

Some time after both interfaces had been deployed to production we were made aware of a issue regarding the titles of certain publications. In one case the title that was being displayed was:

2]

Obviously this is not correct and it was brought to our attention that titles may be stored within Research Master as a Rich Text Formatted string if the title cannot be represented by a plain text string. The correct title that should be displayed in this situation is represented by:

{\rtf1\ansi\deff0{\fonttbl{\f0\fswiss Arial;}{\f1\fnil\fcharset2 Symbol;}} \viewkind4\uc1\pard\lang1033\f0\fs24 2] \fs18 Unprecedented \f1\fs24 m-h\up5\fs14 2:\up0\fs24 h\up5\fs14 2\up0\f0\fs18 - pyrazolate coordination in [\{Yb(\f1\fs24 h\up5\f0\fs14 2\up0\fs18 - \f1\fs24\'a6\f0\fs18 Bu\dn5\fs14 2\up0\fs18 pz)(\f1\fs24 m\f0\fs18 -\f1\fs24 h\up5\f0\fs14 2\up0\fs18 :\f1\fs24 h\up5\f0\fs14 2\up0\fs18 -\f1\fs24\'a6\f0\fs18 Bu\dn5\fs14 2\up0\fs18 pz)(thf)\}\dn5\fs14 2\up0\fs18 ] \par }

This translates into:

Representing this title in plain text or even HTML is a not trivial exercise. Fortunately the RTF::HTML::Converter perl module is able to convert the RTF into:

2] Unprecedented m-h2:h2- pyrazolate coordination in [{Yb(h2 - �Bu2pz)(m-h2:h2 -�Bu2pz)(thf)}2]

While this is not the exact title of the publication it was sufficiently close enough to be accepted as a satisfactory solution to the client.

In order to preserve the interface to the Publication class, code was added to the publication class that, if present in Research Master, would decode the RTF version of the title and use it instead.

Conclusion

The Monash Research Directory is an example of how perl can be used to draw information from two sources, one a commercial application, and present the information, after appropriate processing, in two similar but disparate environments. It is even more signifigant when it is taken into account that apart from standard perl constructs the only additional modules required were the publically available DBI, DBD::Oracle, Net::LDAP and RTL::HTML::Converter.

The public interface to the Monash Research Directory can be found at http://monash.edu/research/directory/.