Things I learned implementing an ORM for MySQL

I finally got a break from all of the overtime at my day job so I used some of my new-found free time to implement an ORM for MySQL. This is part of the Tortuga Chain series of ORMs with the same goal of exposing the database’s capabilities rather than being a “lowest common denominator” like more ORMs.

That said, one still needs to implement the LCD features before moving onto the more interesting things.

Databases and Schemas

In most databases that I work with, you get one server per physical machine. Within that machine you have more or less isolated databases, which are then subdivided into schema (think namespaces) and finally objects (tables and views).

MySQL treats “database” and “schema” as synonyms. Which means that you have an important question when implementing a MySQL database: do you sacrifice the concept of namespaces or the concept of isolation?

  • Losing isolation means that you cannot do things like place your DEV and QA databases on the same server.
  • Losing namespaces means that you cram everything together, making it difficult to navigate the database and understand the purpose of tables. I often see this mistake in SQL Server, where everything is dumped into “dbo”.

This matters for the ORM because of its reflection capabilities. When someone asks for a list of all tables in a database, do they mean “All tables for this database/schema” or “All tables for all schema”? The answer depends on which choice, namespaces or isolation, they made when they designed the database.

Ultimately, we decided to go with returning all non-system schema with the idea that the programmer can filter by schema when necessary.

Parameters in Stored procedures

Parameters in MySQL are a little strange in that they don’t have a prefix. Consider this query:

SELECT Name, Birthday FROM Students WHERE State = @State;

The obvious translation to a stored procedure would be:

CREATE PROCEDURE GetBirthdaysByState (@State CHAR(2))

BEGIN

SELECT Name, Birthday FROM Students WHERE State = @State;

END

This doesn’t work because MySQL because parameters don’t have prefixes. So we try this:

CREATE PROCEDURE GetBirthdaysByState (State CHAR(2))

BEGIN

SELECT Name, Birthday FROM Students WHERE Students.State = State;

END

But this doesn’t work either because “ = State” is ambiguous. So a common habit is to use an abbreviation.

CREATE PROCEDURE GetBirthdaysByState (St CHAR(2))

Aside from the obvious inconsistency in naming, this causes problems with the ORM. Consider these call chains.

var studentSearch = new SearchDTO { State = "CA");

var results = dataSource.Sql( sqlStatement, studentSearch )…

var results = dataSource.Procedure( "GetBirthdaysByState",
new {St = studentSearch.State } )…

You have to ‘leak’ the non-standard abbreviations into your public API or map between the filter DTO and the real parameter names. Neither is particularly pleasant.

Another option would be to decide on a standard prefix. For example, in one project I always used “p_” for parameter names.

CREATE PROCEDURE GetBirthdaysByState (p_State CHAR(2))

At first glance, this looks like it is just as bad as the abbreviations. But this gives us a trick.

dataSource = new MySqlDataSource (connectionString,
new MySqlSettings { parameterPrefix = "p_"} );

Now the ORM understands that if a parameter starts with “p_” that it should look for a matching property without the prefix.

Though this does raise the question, “Do we support multiple prefixes?”. Maybe some procedures used “p_” and some used “param_” and some use just “p”. Obviously, no one would set out to do this, but over time databases coding conventions tend to drift.

See ticket 228 on Github for the current status of prefix support.

Random Sampling

I’m sure this seems odd to most of you. Why would an ORM need the ability to randomly select rows? Actually, there are actually several answers for that.

The first is simply stress and smoke testing an application. In these tests, a random selection of keys is generated and then fed into the application to ensure that it can read a variety of records. This sort of test is great at find bugs in larger systems where the code makes incorrect assumptions about the data.

Another use is for database analysis tools. Tortuga Drydock, another work in progress, samples records to get a better understanding about how the tables are being used. This leads to recommendations such as marking a column as non-nullable or unique.

MySQL doesn’t natively support random sampling. You can fake it by using “ORDER BY RNAD()”, but this has an important limitation. The way it works is that every row is given a random number, then the whole table is sorted in memory, after which the top N rows are selected. This makes it unsuitable for very large tables, which is exactly when you would want to use random row selections.

Side note: Using ORDER BY RAND doesn’t work in every database. For example, if you ask SQL Server to generate a random number, it will generate a single random number to be shared by every row.

No OUTPUT/RETURNING Clauses

In you aren’t familiar with the term, the OUTPUT or RETURNING clause is something you can use with INSERT, UPDATE, and DELETE statements. It basically gives you access to the previous and new version of the row with an atomic operation.

Tortuga Chain’s SQL generator heavily leverages this feature to support scenarios such as:

  • Returning a newly created row
  • Returning the previous version of an updated row
  • Returning the new version of an updated row
  • Returning a newly deleted row

Since MySQL doesn’t support this, the SQL generator has to build up multiple statements. This offers poorer performance and if it runs outside of a transaction, it could lead to inconsistencies.

Writing this code wasn’t actually too much of a burden because we already wrote it for SQL Server. Even though SQL Server supports and OUTPUT clause, that fails if the table has a trigger on it. So our SQL Server implementation has both an OUTPUT clause and a multi-statement code path. (Database reflection is used to identify tables with triggers.)

Real Upset Support

No ORM can be considered complete if it doesn’t support the humble upsert or combined insert/update operation. The ability to tell the database to “create this record if it doesn’t exist, otherwise update” is just too useful to ignore.

There are two basic cases for this: identity (a.k.a. auto-number) and non-identity column. For tables with an identity column, the ORM can simply emit an insert or update depending on whether the primary key is null or not. For non-identity primary keys, some database help is needed.

In our first database, SQL Server, this turned out to be rather painful to write. It involved a complicated MERGE statement, which can be problematic because that statement isn’t ATOMIC in SQL Server.

For MySQL, things were much easier. One simply needs to add “ON DUPLICATE KEY UPDATE” to the INSERT statement.

SQL Standards

MySQL gets a lot of flack for not adhering to ANSI SQL standards. And for people who write a lot of stored procedures, I can understand their pain. But as an ORM writer, it really isn’t a big deal. There were some special cases that you saw above, but for the vast majority of the operations we support were easily handled by the generic SQL generator.

MySQL support is scheduled for Tortuga Chain 2.0, but may be pushed back slightly depending on how the testing goes. There is an alternate driver for MySQL that I want to try out. Depending on the results, I may support both it and the official version from Oracle.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s