TravisSwicegood.com

20 November

Repository Pattern in PHP

When I worked with Josh Eichorn a couple of years ago, we were designing the framework that our company's software was built on. I wanted to move more toward a Repository style architecture instead of interacting directly with the database via streamlined a DAO. Josh argued at the time that dealing directly with the database was a better option. It removed a layer of complexity and he could go to the bookstore down the street and pick one of a few hundred books on SQL off of the shelf and give it to someone new to bring them up-to-speed on how to access the system. As an added bonus, you have all of the power of SQL at your fingertips whenever you need to do something complex. His argument won me over, and until this past week I've viewed that as probably the most sane way to go.

The problem with this method isn't so much with the exposed SQL. If that was the only method of access, I would be fine with it. Programmers are a lazy bunch, however, and one of us always ends up adding a findById($id) method and it's off to the races. Next thing you know, there's helpers for all sorts of queries, with overrides in the form of optional parameters, calls to other helpers that generate an associative array that's used by the method to build a query that's sent to the data access layer and... I could go on. Now, what started out as the simplest solution has morphed into this unrecognizable mishmash of SQL, code and convention.

So, my new motto:

All SQL should be hidden to protect developers from themselves and allow someone else who knows a bit about mildly complex SQL statements to build the queries as needed. - Travis Swicegood, November 2007, motto subject to change...

Here's the problem. You have widgets, each widget has a creator, and N users assigned to it. Your query must return all widgets that the current user is either the creator of, or one of the users assigned to it. If you've ever done an OR query across multiple tables, you've already seen the problem when you imagined the SQL that would normally be generated for that query.

SELECT
    widgets.*
FROM
    widgets
    LEFT JOIN widgets_users USING(widget_id)
WHERE
    widgets.creator_id = :current_user_id
    OR widgets_users.user_id = :current_user_id

Now I've just created a full table scan of widgets_users. A system with a few hundred widgets and a few hundred users and each widget with a few users assigned to it, it's not unreasonable to expect that widgets_users table to end up the ten or twenty thousand records quickly. Add more widgets and more users, the problem keeps getting worse.

In order for that data to be retrieved at a reasonable speed, the best solution I found was to create a derived table from two select statements that are unioned, each containing their half of the or criteria:

SELECT
    widgets.*
FROM
    (
        SELECT * FROM widgets WHERE creator_id = :current_user_id
        UNION
        SELECT 
            widgets.*
        FROM
            widgets
            INNER JOIN widgets_users USING(widget_id)
        WHERE
            widgets_users.user_id = :current_user_id
    ) as widgets

Raise your hand if that was the first query that popped into your head when you thought "any widget that a user created or has been assigned". It wasn't mine either, but that's the one that executed with the least amount of overhead on some work I did last week. In the straight SQL codebase, you change the query and move on, but in the second, more common, helper-oriented codebase that I described earlier, you more often than not end up with code that looks like this:

$model = new Widget();
$data = $model->getAll(
    $current_user, 
    array(
        'extra_tables' => 'LEFT JOIN widgets_users USING (widget_id)',
        'extra_criteria' => 'OR widgets_users.user_id = :current_user_id',
    )
);
echo new GenericListView($data);

That last line was just wishful thinking on my part. ;-) Most of the time it's assigned to something like Smarty to do the templating instead of letting code to it. But I digress, the problem with this half and half method is that now I have to touch multiple areas to get the change in the query that I want. That's if there's enough flex points in the code above to allow me to inject my own implementation as the query is being built. In the above code, so many assumptions are being made that you almost have to scrap the existing behavior and start fresh with a new method to get what you want.

The Repository pattern helps remove this problem. All developer interaction is at the meta level, they could care less where the data is coming from.

$repo = new Repository();
$criteria = new Criteria('Widget');
$criteria->addFilter(
    new Criteria_Filter(Widget::creator, $current_user),
    new Criteria_Filter(Widget_Users::user, $current_user)
);
$widgets = $repo->matching($criteria);

echo new GenericListView($widgets);

Now my Repository object gets to build the entire query and return the data. Even if it doesn't have the flexibility to inject custom query templates directly into the system, my changes are much more localized and hidden behind the firewall of the Repository object. Assuming I create a lot of list views, it could even be decorated to simplify the code:

$criteria = new Criteria_CreatorOrAssignedUser($current_user);
$repo = new WidgetRepository();
$widgets = $repo->matching($criteria);
echo new GenericListView($widgets);

This could be achieved by adding helpers in to the straight SQL model making this approach seem like the logical conclusion to adding helper methods and objects as you move from a straight SQL model to the helper mishmash, but there is a distinct difference between the two. It is subtle, and 100% perception, but there nonetheless. With the helper model developers are thinking in SQL - i.e., I'm hiding the query that's being built. With the Repository model, developers are thinking in code - i.e., I'm working on this dataset, wherever it comes from.

In the above code, there is a specialized, domain-specific criteria and decorated the repository so it automatically assumes we're after a Widet, but now all that layer is expected to do is query whatever data source it can and retrieve the data. All my client code cares about is that it has a list to work with, all my Repository code cares about is that it can create that list.

Update: When I finished this article Sunday night I realized that I had almost argued myself in a circle and decided to hold off the final edit. As a I edited it this, I realized that a properly separated SQL based can still work, but 9 times out of 10, it doesn't because the layers get mixed up. If the Smarty evangelists are to be believed, programmers can't be trusted with themselves. In that case, SQL with helpers can't be done properly and the Repository is the last refuge for those of us who want to make sure there's a proper split between the data/storage layer and the domain layer.

10 comments

I use something similar to this in the framework I use. It's probably not an entire implementation of the solution, but we use PostgreSQL views (which are also available in MySQL5 if I'm not mistaken) to do queries that span multiple tables and put them into Model classes (WidgetsModel extends Model, etc) and access them as class variables with a few methods for finding and adding columns (like COUNT(*)) and where clauses as needed. Then, if the query needs to be changed or optimized, you just update the view on the PostgreSQL server and you're off to the races. Just to keep my sanity, since it makes code portability suck, I put the full query to update the view in comments in the Model so I can see it if I need to.

I just think that having all those methods to construct all the parts of your query makes for unreadable and messy code and can make for some ugly hacks.
You could always profile your SQL and rewrite it to return the same results in a more efficient manner. :)

@Christine: exactly! :-) The average middle of the road developer doesn't know how to do that kind of magic though. If you hide all data mapping behind a Repository, it allows guys who do know a thing or two about the SQL engine you're using to take advantage of its features.

@Adnan: Yes, you could, so long as you have full access to the raw SQL and/or can use raw SQL. That's the problem with the SQL mishmash model though, its rarely designed to drop in completely alternate SQL queries instead of attempting to build them. With the Repository pattern, you hide all SQL generation so optimization can happen. The average developer, at least from my experience, doesn't even know how to begin profiling their queries, much less how to go about optimizing them.
First query that came to mind for me was a UNION (no join / subselect) but that is only because I've been bitten by the OR problem more times than I would like to admit. :)

One more benefit I have been thinking of what you are talking about is the ability to reduce code duplication in your SQL queries. I don't know how many reports we have in our product (loan management software) that report over alot of the same data with a few subtle differences between reports. With a model somewhat similar to yours the core of the reports (mostly joins and always needed columns) can be implemented in one class and individual reports can extend and add the group bys, order bys, havings, additional columns / joins, etc.

I am sure you know the pain I speak of when you have half a dozen large queries that work on much of the same data and you decide to change one small aspect of these reports and wind up forgetting about that OTHER report. :P
I would agree that all the lame SQL helpers make things pretty horrible. Your almost writing sql but not really, its about the worst possible place on the continuum to end up. The queries as slow as possible and hard to profile and rewrite.

Oh and naming wise I don't like repository. Its way to generic.
@Josh: Way too generic is the point of a pattern name ;-) You can the class that implements the pattern whatever you like. If you're into naming classes for what they do, a suitable name would be ProtectMeFromMyselfOlWiseKeeperOfData.

@Mike: The power of the union is that I get a full list of data back so I can sort and paginate it.
oh man, I think I have an idea of where this post originates from, and I agree with the pain in trying to improve the retrieval of widgets in a system.
just as separation of UI and layouts (smarty templates) and php code is more commonplace, so should the php be shielded from complex and optimized sql queries and code. you have good thoughts and are taking good steps towards that path too!
About SQL profiling

MySQL queries can easily be profiled with the built-in query logger, with MySQL 5.1.21+ (or a patched 5.0) even with microseconds granularity.

Just run the application as normal and filter the Slow Query Log afterwards with an easy though customizable script:
php mysql_filter_slow_log.php --exclude-user=root --no-duplicates linux-slow.log > mysql-slow-queries.log

Activate the Slow Query Log in my.ini:
log-slow-queries
# Log only those queries that run 0.3s or more (value given in microseconds)
long_query_time=300000
# Additionally log all queries which do not use indexes
log-queries-not-using-indexes
# MySQL 5.1.6 through 5.1.20 had a default value of log-output=TABLE, so you should force
log-output=FILE


http://code.google.com/p/mysql-log-filter/
To follow up on what was said by Christine in the first post, it is much simpler to use View, Procedures and Triggers (all are now supported with MySQL 5.x). The future is to rely upon the database for the data model and then use simple programming abstracts to generate the code.

When you want to optimize you can use the slow query log (which you should be generating anyway) to find out the SQL which are causing problems.

IMHO using Procedures, Triggers and Views takes away the entire purpose for creating this pattern.
There is a great example of the repository pattern here ( watch part 2 ):

http://blog.wekeroad.com/mvc-storefront/

It's .net MVC but you get the idea, it also uses LINQ which looks very cool. The general idea they give is to move everything down to the most basic interfaces, keeping the interfaces very simple! Also in part3 they go through the pipes and filters pattern which works brilliantly. I just need to convert all that to PHP now hehe :)

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)