Thursday, 28 June 2007

Mats Helander, whom I have already had the pleasure to meet personally several times, wrote about an O/RM challenge in his blog.

While it is always fun to participate in challenges, I want to criticize the problem Mats describes first, before I show how you can solve it with Genome.

The challenge only concerns how efficiently an O/RM can read up a set of whole tables from the database. This does not make sense for two reasons:

  1. Usually, you don’t want to load up all the data from a database into memory (that’s one of the reasons why you use a database).
  2. If you have special cases, where you cache whole tables from a database (e.g. some lookup data), caching takes place very seldom (e.g. once a day) and thus the efficiency of loading up the data is not of such a big importance.

Mats expresses the challenge in such a way that he demands that the O/RM may not join the related objects when loading from the database to find out about their relationships. Instead, the O/RM should load up all objects at once, and “discover” the relations between the objects afterwards on its own (without using the JOIN from the database). This results in three SELECT statements (SELECT * FROM Customers; SELECT * FROM Orders; SELECT * FROM OrderLines).

An O/RM usually maintains only an identity map to cache object lookup queries. This helps object references mapped through foreign key fields in the database to be followed without extra database roundtrips (given that the related objects are already loaded into memory). This means that following an Order to its related Customer works in memory, if all data is loaded up. To discover the Orders belonging to the Customer, however, the O/RM needs to perform a lookup query.

Some O/RMs, including Genome, allow collections to be preloaded in order to avoid unnecessary roundtrips when traversing object graphs deeply. So you can tell the O/RM to preload all the Orders of the Customers retrieved, and to preload all the OrderLines of the Orders retrieved. In this case, the O/RM builds a map for relating the objects in memory while loading up the data.

Usually you only want to load up the related children of the parent table. It doesn’t make sense to load up all orders from the database only to fulfill the orders of three specific customers. To ensure this, an O/RM typically JOINs the related data to the filtered parent table.

Not filtering the parent table is a very special case. Introducing an optimisation for this case is possible, but would make no sense (for the initial reasoning above). Besides that, I wonder how large the loaded table has to be in order for that additional JOIN to make a significant difference, giving the whole performance optimisation sense at all. I guess in those cases, it is out of the question to cache the results in memory anyway, which is the premise of the scenario. Another drawback of this optimisation I want to point out is that it can turn out to be less efficient very quickly when the parent reference is nullable, as unnecessary data is loaded up again.

Still, this is a challenge and a lot of people interested in O/RM read it; so, let’s solve it with Genome.

Genome provides two infrastructures for retrieving and caching relations: collections and indexing.

The collection infrastructure provides rich support for handling specialised relation types such as 1:n and n:m relations. Usually, I would recommend using Genome’s collection mapping feature to support Mats’ scenario, except that Genome uses a JOIN to limit the related objects loaded up from the database.

Indexing is a Genome infrastructure that automatically detects even complex relationships, based on the loaded data. It is more complex to configure, use and maintain, but it can support Mats’ exotic scenario. Having mapped the business layer with Genome, the following three lines of code will do the trick:

using (Context.Push(LocalContext.Create()))
{
    IndexManager.FillIndex(Context.Current, dd.Extent<OrderDetail>(), 
                           IndexManager.GetIndex(dd.Schema, typeof(OrderDetail), "IdxOrder"));

    IndexManager.FillIndex(Context.Current, dd.Extent<Order>(),
                           IndexManager.GetIndex(dd.Schema, typeof(Order), "IdxCustomer"));

    Set<Customer> customers = dd.Extent<Customer>().ToArray();

   Dump(customers);

}

Inside the using block, the first two lines of code load up all OrderDetails and all Orders. Additionally, they saturate the indexes for the relationships Order->OrderDetail and Customer->Order. The third line of code loads up all customers. When Dump(customers) traverses through the object graph, all relationships are served from memory.

Note that this feature is not limited to simple 1:n and n:m relationships. It works for more complex relationships as well, such as retrieving pending orders of a customer etc.

Posted by Chris

Technorati Tags: object relational, challenge