A couple weeks ago, I read the article, LINQ to SQL Extension: Batch Deletion with Lambda Expression by Jeffrey Zhao. In case you didn't read the article, it discusses the downside of most O/R Mapping frameworks when it comes to multiple updates or deletes. He states the fact that a SQL statement for each row flagged as update/delete in the entity set is created. I went about implementing something similar to what Jeffrey envisioned and I'll explain some of the hurdles I had to overcome to achieve it.
[UPDATE: I've made a code fix and republished the code. To read the reason for the fix, you can read a more recent post here upon completing this post]
Since I recently converted my site to blogengine.net, I'll use its database schema for all examples to follow. In addition to this, all code snippets you'll see were run in LINQPad, which at the moment is by far my favorite development tool that I'm using (now if they'd only get intellisense in there, I'd play with this program all day long). If I were to include a review of LINQPad, it would be glowing but I don't want to consume this post on that, however in case you do not know what LINQPad is, for the purpose of this article, it allows us to write LINQ to SQL statements along with any other arbitrary C# code (basically just a disclaimer that you will not see any DataContext creation/code in my samples since it is implicit in LINQPad).
So what's the problem with DeleteAllOnSubmit()?
As the title of this post implies, this is really only an issue when you want to update and/or delete rows from a table with the same 'rules' applied for every row in an entity set. In Jeffrey's article, he gave an example of 'deleting all rows where creation date was older than a month'. This is a perfect example of when a O/R Mapping framework issues multiple delete statements when a single 'simple' SQL statement could be written. So lets say I had the following code:
1: var posts =
2: from p in be_Posts
3: where p.DateCreated.Value.CompareTo( DateTime.Today.AddMonths( -1 ) ) < 0
4: select p;
5:
6: be_Posts.DeleteAllOnSubmit( posts );
7: SubmitChanges();
Upon calling SubmitChanges(), the following SQL statements would be issued:
1: SELECT [t0].[PostID], [t0].[Title], [t0].[Description], [t0].[PostContent],
2: [t0].[DateCreated], [t0].[DateModified], [t0].[Author], [t0].[IsPublished],
3: [t0].[IsCommentEnabled], [t0].[Raters], [t0].[Rating], [t0].[Slug]
4: FROM [be_Posts] AS [t0]
5: WHERE ([t0].[DateCreated]) < @p0
6: -- @p0: Input DateTime (Size = 0; Prec = 0; Scale = 0) [3/13/2008 12:00:00 AM]
7:
8: DELETE FROM [be_Posts] WHERE [PostID] = @p0
9: -- @p0: Input UniqueIdentifier (Size = 0; Prec = 0; Scale = 0) [0ea7b3b7-1b27-4b5d-a6ba-69e5711e3bb1]
10:
11: DELETE FROM [be_Posts] WHERE [PostID] = @p0
12: -- @p0: Input UniqueIdentifier (Size = 0; Prec = 0; Scale = 0) [0ea7b3b7-1b27-4b5d-a6ba-69e5711e3bb2]
13:
14: -- Inidivdual DELETE statements for each row in the IQueryable<be_Posts> variable.
So as you can see, to simply delete a list of be_Posts rows, LINQ to SQL first queries all the data. The performance impact is directly based upon what your row contains (i.e. binary data, or PostContent in this case) and how many rows the LINQ query returned. Then, LINQ to SQL iterates all rows return, flags them as deleted, then generates a single DELETE statement. This DELETE statement creates a simple WHERE clause using the row's primary key (PostID in my case). In addition to this overhead, Jeffery did not like the need to call SubmitChanges() on the DataContext object to confirm database changes. Now I whole heartedly agree about combining simple DELETE SQL statements into one, but I actually disagree with the desire to eliminate the call to SubmitChanges(). More on that later.
Okay, that's a problem, how do we fix it?
In a LINQ to SQL utopian world, executing the C# code above, the provider would generate only a single DELETE statement as follows:
1: DELETE FROM [be_Posts] WHERE [DateCreated] < @p0
2: -- @p0: Input DateTime (Size = 0; Prec = 0; Scale = 0) [3/13/2008 12:00:00 AM]
So how can we do this? One of the shortcomings of Jeffrey's code (self admittedly) was that he was essentially generating the SQL statement himself by using an ExpressionVisitor and took LINQ to SQL provider 'out of the loop'. Because of this, he did not support all the functionality that LINQ to SQL provided (i.e. post.Title.Length < 10). One of the comments on his post suggested 'wrapping' an INNER JOIN around the SQL generated by the original IQueryable expression and that is exactly what I was thinking (at least I keep telling myself that...nice suggestion Richard). The great thing about this is now, we only have to do a little SQL statement generation to append on to whatever LINQ to SQL gives us. This keeps all the 'complicated' stuff where it belongs: in Microsoft libraries. So, now Utopia would end up with a SQL statement like so:
1: DELETE [be_Posts]
2: FROM [be_Posts] AS j0 INNER JOIN (
3:
4: SELECT [t0].[PostID]
5: FROM [be_Posts] AS [t0]
6: WHERE ([t0].[DateCreated]) < @p0
7:
8: ) AS j1 ON (j0.[PostID] = j1.[PostID])
9:
10: -- @p0: Input DateTime (Size = 0; Prec = 0; Scale = 0) [3/13/2008 12:00:00 AM]
To do this, I decided to create an extension method on the Table<TEntity> with an IQueryable<TEntity> argument. I did this to emulate the LINQ to SQL libary's Table<TEntity>.DeleteAllOnSubmit( IEnumerable<TEntity> entities ) function. In addition to having a similar interface, it also allows you to build up your final IQueryable parameter through multiple Query Syntax queries. Consider the following 'trivial' example. I know this could be written in a single Lambda expression, but I would find it easier to read and maintain as individual queries. Even if they were all combined into one Query Syntax query, to me it is more readable than 'stuffing' it all in on one Lambda parameter.
1: var posts = from p in be_Posts select p;
2:
3: var oldPosts =
4: from p in posts
5: where p.DateCreated.Value.CompareTo( DateTime.Today.AddMonths( -1 ) ) < 0
6: select p;
7:
8: var myPosts =
9: from p in oldPosts
10: where p.Author == "Terry Aney"
11: select p;
12:
13: var first10 = myPosts.Take( 10 );
14:
15: // Do something to first10...
Without drilling into the supporting methods quite yet, essentially, my extension method is defined as follows:
1: public static int DeleteBatch<TEntity>(
2: this Table<TEntity> table, IQueryable<TEntity> entities )
3: where TEntity : class
4: {
5: DbCommand delete = table.GetDeleteBatchCommand<TEntity>( entities );
6:
7: var parameters = from p in delete.Parameters.Cast<DbParameter>()
8: select p.Value;
9:
10: return table.Context.ExecuteCommand( delete.CommandText,
11: parameters.ToArray() );
12: }
So we've got DELETE whipped, what about UPDATE? (take one)
Admittedly, the DeleteBatch() extension method was pretty easy to implement. Batch updating is where I encountered all of my dilemmas. Following the same principle as I described above, my function signature looked like:
1: public static int UpdateBatch<TEntity>( this Table<TEntity> table,
2: IQueryable<TEntity> entities,
3: object evaluator )
4: where TEntity : class
5: {
6: }
Now you should notice the evaluator parameter of type object and begin to get suspicious. The first dilemma to deal with was how powerful was I going to make this extension method. I saw Jeffrey's example of passing a Lambda expression as a parameter that defined how to update each row being processed, but admittedly my expression tree knowledge was zero. Additionally, I naively thought that was an unnecessary feature for me because I figured all my batch updates would be simply setting column values to a fixed value that could be determined up front. So my first implementation, evaluator was passed in as an anonymous type and I used reflection on all the public properties and looked for matching field names on the TEntity. So extending on my 'first10' Query Syntax example variable above my update might have looked like this:
1: be_Posts.UpdateBatch( first10, new { Author = "Chris Cavanagh" } );
Now the immediate downside you'll notice to this, which is basically completely contradictory to O/R Mapping frameworks, is that I will get no intellisense or type checking on the Author property. However, I accepted this because:
- As I mentioned, I had no expression tree knowledge/experience and this appeared to be the best I could do and;
- I figured most of my 'batch statements' were going to be executed in LINQPad anyway, and since it didn't have intellisense yet I wasn't really 'losing' anything ;)
On a bit of a side note, although it does have a purpose, at this point, I was all proud of myself. I had created the extension methods and was ready to write my first technical blog (until this point my website was all personal). So I wanted to make sure that Google was going to be able to find my post about my 'great' extensions once I finished writing about them. However, upon submitting my sitemap to Google, there were errors in the feed. It turns out that my 'modified' date on my posts (URLs) were null and returning an invalid date. This is because I hand populated blogengine.net's database from my previous home grown blog site and I forgot to populate that field. So I needed to simply update all my posts, setting 'modified' equal to 'created'. BatchUpdate, here I come...er, crap I can't update columns based on other columns. Jeffrey's vision was right and I needed help.
All right, how hard can these expression trees really be?
I wouldn't know, I'm still clueless. Just kidding, I understand them slightly now, but only because my end implementation is still fairly simple. To get to that point, I enlisted the help of Chris Cavanagh, a co-worker, since he was worlds beyond me in knowledge of expression trees (actually, the same can probably be said for any technology...but hey, I have him beat in VBScript and Transact SQL!). So the second iteration's function ended up looking like the following:
1: public static int UpdateBatch<TEntity>(
2: this Table<TEntity> table,
3: IQueryable<TEntity> entities,
4: Expression<Func<TEntity, TEntity>> evaluator )
5: where TEntity : class
6: {
7: DbCommand update = table.GetUpdateBatchCommand<TEntity>( entities,
8: evaluator );
9:
10: var parameters = from p in update.Parameters.Cast<DbParameter>()
11: select p.Value;
12:
13: return table.Context.ExecuteCommand( update.CommandText,
14: parameters.ToArray() );
15: }
Given this, to perform the 'modified' equals 'created' update example from above, you could have the following (the Author assignment is for a subsequent sample):
1: var posts = from p in be_Posts select p;
2: be_Posts.UpdateBatch( posts, p => new be_Posts { DateModified = p.DateCreated,
3: Author = "Chris Cavanagh" } );
The Lambda expression parameter contains an object initializer. From this, we extract the field names and values to assign to every row. This expression will never be invoked; it's used merely as a convenient syntax to describe the assignments we want and convert them into the appropriate SQL SET statements. Of course we want the underlying LINQ Provider do all the SQL generation for us and we accomplished that in the following steps:
- Using an ExpressionVisitor, visit the single MemberInitExpression (i.e. new be_Posts { ... }) and process each MemberAssignment in MemberInitExpression.Bindings.
- If the binding uses the argument in the assignment (i.e. DateModified = p.DateCreated),
- We create a MethodCallExpression which represents a 'simple' select of only the assignment part (i.e. p.DateCreated) of the expression
- Given the MethodCallExpression, call IQueryProvider.CreateQuery() to convert it into an IQueryable object.
- Given the IQueryable object, call DataContext.GetCommand(), therefore letting the LINQ Provider generate SQL.
- Given the resulting DbCommand.CommandText (i.e. SELECT DateCreated FROM be_Posts), parse out the required text to generate a SET statement (i.e. DateCreated).
- Given the resulting DbCommand, take any DbParameter objects and merge them into the final update command.
- If the binding does not use the argument in the assignment, but rather just predefined constant values (i.e. Author = "Chris Cavanagh")
- Compile and invoke the assignment expression to obtain the constant value*
- Add the constant value as a parameter to the final update command.
- Given all the individual SET statements, combine them into one SQL UPDATE statement with appropriate parameter names.
* For compiling the expression and invoking it, the following code was used:
1: var constant = Expression.Lambda( assignment.Expression, null )
2: .Compile().DynamicInvoke(); } );
So now, if we previewed the the SQL statement that would be generated, we'd see something like:
1: UPDATE [be_Posts]
2: SET [DateModified] = [DateCreated], [Author] = @p0
3:
4: FROM [be_Posts] AS j0 INNER JOIN (
5:
6: SELECT [t0].[PostID]
7: FROM [be_Posts] AS [t0]
8:
9: ) AS j1 ON (j0.[PostID] = j1.[PostID])
10:
11: -- @p0: Input NVarChar (Size = 14; Prec = 0; Scale = 0) [Chris Cavanagh]
So you only mentioned one dilemma, where are the others?
So there were just a couple issues that threw me for a loop when testing out my extension methods (thankfully all of them were resolved).
- If I used the Take() function in my queries, the generated SQL didn't seem to be processing it. I was missing the TOP N part of the select statement.
- If I used the orderby keyword, the generated SQL was invalid. Since it was wrapped in a sub query, it needed a TOP clause (easy enough to add TOP 100 PERCENT once I figured out the #1).
- During my update command creation, I was adding parameters as @p{fieldName}{0-N} to avoid any @p{0-N} naming conflicts with both the original INNER JOIN sub query and all the separate SET queries. I was 100% positive that my name was correct in both the SQL SET statement I created along with the name used when I added the DbParameter to the final update command. However, upon executing the command, I was getting an error saying that my parameter was not declared.
- What to do about SubmitChanges()? I didn't resolve this one and don't even know if it is possible. Like I said, I disagree with Jeffrey's distaste for the additional call. The simple reason being that it now behaves differently than rest of 'standard LINQ' CRUD operations. If you were going to chain an UpdateBatch() or DeleteBatch() with a bunch of other CRUD actions, the standard ones will not apply until SubmitChanges() is called, yet my *Batch() methods will immediately apply. It just feels wrong to me is about all I can say. Although, if you expose stored procedures in your DataContext, executing them in the middle of standard CRUD operations behaves the same way in that they are immediately applied regardless of any call to SubmitChanges(), so I guess I can 'stomach' the behavior of my *Batch() functions. If anyone knows how to hook into the DataContext ChangeSet, let me know!
#1 and #2 above were both easy. They were due to me incorrectly parsing the DbCommand.CommandText created from the original entities parameter passed in used to create the INNER JOIN sub query. #1 was only slightly tricky because I was trying to be clever and change what the SELECT actually returned. By default, it would return every field on the table, but as I mentioned this could be a lot of data based on the table columns. So I instead wanted to return only the fields required to complete the INNER JOIN - the primary key fields. So when parsing, I mistakenly stripped off too much information and my command was no longer aware of any TOP clause that may have existed. You can see this in detail in the code.
So the tricky one that took me a while to figure out was #3. I just couldn't see how it couldn't be working. I mistakenly sent a forum post over to LINQPad trying to 'blame' LINQPad for some weird behavior. Shame on me. Sorry about that Joseph (creator of LINQPad), I should have never doubted you! It only got to that point because when I previewed my CommandText, it looked something like this (notice the parameter name in line 2 and 11):
1: UPDATE [be_Posts]
2: SET [DateModified] = [DateCreated], [Author] = @pAuthor0
3:
4: FROM [be_Posts] AS j0 INNER JOIN (
5:
6: SELECT [t0].[PostID]
7: FROM [be_Posts] AS [t0]
8:
9: ) AS j1 ON (j0.[PostID] = j1.[PostID])
10:
11: -- @pAuthor0: Input NVarChar (Size = 14; Prec = 0; Scale = 0) [Chris Cavanagh]
The above output came from me 'copying' System.Data.Linq.SqlClient.SqlProvider.LogCommand method. Looking at this, everything looked good. I could even take this to SQL Management Studio and run it, obviously converting the -- @pAuthor0 comment into a valid DECLARE / SET statement, and everything worked.
Well the problem ended up being that when I outputted this log, I was doing it before calling DataContext's public int ExecuteCommand(string command, params object[] parameters). There is a subtle thing going on here when you call ExecuteCommand that I missed entirely. The LINQ provider is recreating a new DbCommand and simply looping every value in the parameters array and creating a new DbParameter with the name @p{0-N}. Since I was only passing it an array of objects, it couldn't possibly know what I named the parameters in my original update command. So to fix this, I just had to massage my query construction to use only the @p0 syntax which was only slightly more tedious the the previously mentioned 'easy' @p{fieldname}0 syntax.
So how powerful is this bad boy?
So there you go. Both UPDATE and DELETE statements can be executed in a single SQL statement and I believe that both methods are as powerful as you'd ever need them to be - but if I'm missing something let me know. Let's look at an exaggerated example of what can be done.
1: var posts = from p in be_Posts select p;
2:
3: var oldPosts =
4: from p in posts
5: where p.DateCreated.Value.CompareTo( DateTime.Today.AddMonths( -1 ) ) < 0
6: select p;
7:
8: var myPosts =
9: from p in oldPosts
10: where p.Author == "Terry Aney"
11: select p;
12:
13: var first10 = myPosts.OrderByDescending( o => o.DateModified ).Take( 10 );
14:
15: var anonymousAuthor = new { Author = "Chris Cavanagh" };
16: var great = "great!";
17:
18: be_Posts.UpdateBatch( first10,
19: p => new be_Posts {
20: DateModified = p.DateCreated.Value.AddDays( 15 ),
21: Author = anonymousAuthor.Author,
22: Description = null,
23: Title = "The " + p.Title + " was " + great,
24: Rating = 500 * 500
25: } )
Anyone care to guess the SQL it creates? ;) Here it is in full Transact SQL glory (if there is such a thing):
1: UPDATE [be_Posts]
2: SET [DateModified] = DATEADD(ms, (CONVERT(BigInt,@p2 * 86400000)) % 86400000, DATEADD(day, (CONVERT(BigInt,@p2 * 86400000)) / 86400000,
3: [DateCreated])), [Author] = @p3, [Description] = null, [Title] = ((@p4 + [Title]) + @p5) + @p6, [Rating] = @p7
4:
5: FROM [be_Posts] AS j0 INNER JOIN (
6:
7: SELECT TOP (10) [t0].[PostID]
8: FROM [be_Posts] AS [t0]
9: WHERE ([t0].[Author] = @p0) AND (([t0].[DateCreated]) < @p1)
10: ORDER BY [t0].[DateModified] DESC
11:
12: ) AS j1 ON (j0.[PostID] = j1.[PostID])
13:
14: -- @p0: Input NVarChar (Size = 10; Prec = 0; Scale = 0) [Terry Aney]
15: -- @p1: Input DateTime (Size = 0; Prec = 0; Scale = 0) [3/14/2008 12:00:00 AM]
16: -- @p2: Input Float (Size = 0; Prec = 0; Scale = 0) [15]
17: -- @p3: Input NVarChar (Size = 14; Prec = 0; Scale = 0) [Chris Cavanagh]
18: -- @p4: Input NVarChar (Size = 4; Prec = 0; Scale = 0) [The ]
19: -- @p5: Input NVarChar (Size = 5; Prec = 0; Scale = 0) [ was ]
20: -- @p6: Input NVarChar (Size = 6; Prec = 0; Scale = 0) [great!]
21: -- @p7: Input Real (Size = 0; Prec = 0; Scale = 0) [250000]
Download the updated code