Archive for the 'Core Data' Category

Merging multiple contexts in Core Data

Last time I was working on Wombat, I was trying to get its Core Data multi-threading use correct. Namely, I stopped sharing one context (NSManagedObjectContext) between all threads, and created a new context for each thread. This meant I didn’t have to lock the context each time it was touched, which resulted in better performance. I thought that it would be simple as that. Core Data would take care of the merging of the multiple contexts, and all would be good.

I was close, but it’s not quite that simple.

The first problem I ran into was the possibility of two different clients, each on its own on thread and thus context, posting the same message. Since contexts aren’t saved until a client disconnects, that would work for the first client who quit, but the second one would run into trouble. It would most likely succeed in its save, assuming it had no merge conflicts, but it would compromise the data integrity of the data store. That is, I shouldn’t have the same post in the database twice. That’s bad.

Now the NNTP protocol specifies that each post should have a unique ID, called a Message-ID. Its simply a blackbox string that is supposed to be globally unique. Typically it is a time stamp concatenated with the local host name, and enclosed in angle brackets (<>). Some NNTP clients attempt to generate the Message-ID themselves, but most are smart and allow the server to generate one on their behalf. That means that the odds of receiving duplicate messages from clients are pretty much nil, although the situation still has to be considered.

It’s far more likely that two peer servers connect to Wombat, and offer it the same message that was posted elsewhere on the network. This is still somewhat unlikely, because direct peers should be somewhat rare, and thus the likelihood of them connecting at the same time would be low.

I bring up the rarity of the these events for a reason. If these collisions were frequent it means I should probably go back and consider locking down and updating the contexts more often. That way I always know what messages I have and which ones I don’t. i.e. I should make sure I do preventative maintenance. The downside of preventative is obvious: its slower because I have to bottleneck all threads and update the data store. However, since the message collision events are (theoretically) rare, then I can just assume the thread’s current context is up-to-date. Then, when the thread goes to merge any changes to the context into the data store, it can handle any conflicts then.

Before this point, Wombat never made use of any of Core Data’s validation methods. That’s because everything was serialized through one context and I manually validated the incoming data before I inserted it into the context. For example, before I inserted a message I searched for any messages in the context with the same Message-ID. If I found one, I simply didn’t insert the new message, thus maintaining the integrity of the data store.

Now I needed to be able to catch any duplicates when I went to save the context to the data store. My first thought was: “wouldn’t it be great if Core Data modeling allowed me to specify an attribute as unique?” It would be, but alas, Core Data doesn’t allow it. If I could specify an attribute as being globally unique then I wouldn’t have to write any validation methods, but just let Core Data catch them for me. I also wonder if SQLite would be able to do anything with an attribute if it knew it was unique. For example, create an index on it so searching was quicker.

Anyway, dreams aside, I needed to write a validation method. The first thing I thought of was the validation method generated by Xcode for each attribute. The one of the form:

- (BOOL)validate: (id *)valueRef error:(NSError **)outError;

where is the name of the attribute. The problem with this is that Message-ID never changes, and I actually only want to validate when a new message is inserted into the data store. After searching around the documentation some more, I discovered:

- (BOOL) validateForInsert:(NSError **)error;

It’s a pretty easy to use method, and it only gets called on an inserted object. I simply added code at this point to check for more than one message with the same Message-ID. If I found more than one, I returned NO and put an error in the out parameter. The error parameter is passed back to the call to [NSManagedObjectContext save:&error], so I can do useful things like stuff the offending object into the error and the person who called save will get it.

Now I’d like to take an intermission to rant a bit. The error mechanism here is ghetto at best. If you’ll notice the out error parameter only points to one error, as does the error parameter in save. So what happens if you have more than one error, because, say, you have six messages that are duplicates? You curl up in a corner and cry, that’s what. Then you have to use some convoluted logic to pretend the API was designed to support multiple errors.

First, you have to check to see if the error parameter is nil. If it is, you just jam a pointer to your error in it. If it’s not nil, then you have to check to see if it’s a special error that’s designated as “multiple errors.” If it’s the special “multiple errors” error, then you create a entirely new “multiple errors” error with all the old stuff in it, plus your error added to its array of multiple errors. If the current error is not the special “multiple errors” error, then you have to create one, and jam the old error and your new error in it. Fun stuff.

Hey Apple, you wanna know what else would have worked? An NSMutableArray of NSError’s. Crazy idea, I know.

Anyway, after I figured out how to get errors back to the caller of save, I needed to do something about them. Fortunately processing duplicates are easy. You just deleted them. I thought about doing this inside of validateForInsert, but some of the Apple documentation advises against mutating the context inside of validation routines. Instead, the caller of save just walks the list of errors (which has its own special logic to produce an array of errors out of a single error) and deletes the duplicates, then attempts the save again.

At this point, I thought I was home free. But save kept returning merge errors even after I had deleted the duplicates, and I didn’t know why. The answer turned out to be the merge policy on the context. By default, the merge policy is “don’t merge, and report anything that doesn’t merge.” I’m sure that’s a fine policy for applications who never use more than one context per data store, but it doesn’t work so well in Wombat.

There are actually several merge policies described in Apple’s documentation. NSErrorMergePolicy simply returns an error for each merge conflict, and is the default. NSMergeByPropertyStoreTrumpMergePolicy and NSMergeByPropertyObjectTrumpMergePolicy are similar to each other in that they merge on a property by property basis. They only differ when the property has been changed in both the store and context. In that case, NSMergeByPropertyStoreTrumpMergePolicy takes whatever was in the store. Conversely, NSMergeByPropertyObjectTrumpMergePolicy takes whatever was in the object context. NSOverwriteMergePolicy simply forces all the changes in the object context into the data store. Finally, NSRollbackMergePolicy discards any object context changes that conflict with what’s in the data store.

For Wombat, I chose NSMergeByPropertyObjectTrumpMergePolicy because it does finer grain merging, and because there’s a slim chance what’s in the object context is more up-to-date than what’s in the store.

All in all, merging multiple contexts was harder than I thought, and harder than I thought it needed to be. It would be nice if Core Data could do some more validation automatically (like unique attributes) and if the error handling was better. I also think picking a better default merge policy would help, because when getting the random merge errors, it wasn’t all that obvious to me that the merge policy was the problem.

Core Data and Multi-threading

I’ve been wanting to write this article since Tuesday, but I’ve been distracted by my day job. One of our clients is getting close to shipping so I have to put in more hours than usual. It’s no where near as fun as working on Wombat, but it pays the bills.

Anyway, back when I wrote a couple of weeks ago about Wombat, I mentioned the trouble I was having with Core Data and multiple threads. Basically, I was finding that the entire context (NSManagedObjectContext) had to be locked anytime a thread touched the context or any one of its managed objects (NSManagedObject). That included even accessing attributes on a NSManagedObject as well as mutating them.

Apparently I wasn’t the only one who figured this out. Florian Zschocke, creator of Xnntp, told me that he was running into the same problem of having to lock the entire context each time he touched anything. He was also wondering if there was a better way.

The obvious problem with locking every time is that it defeats the concurrency of threads. The threads end up being serialized anytime they touch the data store. This is pretty troublesome for Wombat, because it’s an NNTP server. Most of its time is spent doing I/O –either reading/writing to the data store or reading/writing to sockets. Accessing the data store is already a potential performance hotspot, and the serialization or threads makes it even worse.

Fortunately there’s a better way. Blake Seely left a comment on my previous post, letting me know that the appropriate way to handle multiple threads is to have a separate context for each thread. About this time I also found some Apple documentation pertaining to Core Data and multiple threads, which echoed Blake’s comments. This is as simple as allocating a NSManagedObjectContext each time a thread is spawned, and handing it the solitary NSPersistentStoreCoordinator.

The one gotcha is that NSManagedObject’s from one context cannot be used in another context. If you want to send an object from one thread to the other, you have to pass the NSManagedObjectID around. This can be obtained by [object objectID] from one thread, then used on the other thread by [context objectWithID:objectID] to get the corresponding object in that context. However, this only works for objects that have been saved. In general, Wombat isn’t going to have to worry about passing objects between threads. That’s because each client is pretty isolated and has no reason to talk directly to another client.

That said, having multiple contexts has some implications for Wombat. Currently each client gets its own thread, and thus its own object context. In the future this will probably change, and clients will be pooled together on a few threads that handle multiple clients using something like kqueue to multiplex sockets. The catch is that no client saves its changes until the remote client closes the connection. Currently that means if Wombat has two clients, and Client A posts an article, Client B will not see that article until Client A quits. For performance reasons, NNTP clients often leave the connection open for a specified amount of time after they’ve done their work.

The behavior is acceptable, but it gets a bit more weird when clients start getting multiplexed by a single thread. In that scenario, Client B might see the article immediately if its in the same pool, or it might have to wait until Client A quits. In other words, some clients will see articles sooner than other clients. Once again, it’s acceptable behavior, but it’s a little odd.

Meanwhile, I’ve been reading Another Day in the Code Mines, which has a lot to say about threading. One of the thoughts that I came away with is that forking processes in Wombat would probably be better than spawning threads. That is, for each client that connects, instead of spawning a thread for it, spawn a process to handle it. Processes are heavier weight, but they provide a couple advantages. First, they provide separate memory spaces for each client, so one client can’t mess with another. Second, if one client crashes, it doesn’t take down the entire server, thus making Wombat more robust. Forking for each client also happens to be a classic NNTP server design, and for good reason.

Unfortunately, as far as I can tell, Core Data doesn’t support this. Multiple contexts can exist because they all share one NSPersistentStoreCoordinator, which serializes all I/O to the data store file. Since SQLite often updates just parts of the file at a time, I can’t imagine that it would allow multiple processes to have the data store file open at once, especially for write. The only way I see around this is to make the data store file its own server. Unfortunately, this reintroduces the single point of failure (if it goes down, all clients go down) and since NNTP is fairly thin protocol over news, it would just end up being something pretty close to an NNTP server itself. Not a win.

In the end, it looks as though I’m just going to give each thread its own context, and then multiplex several sockets on each thread using kqueue. It may not be as robust as forking processes, but it should be possible to get some good performance out of it.

NSPredicate and regular expressions

I’ve managed to figure out how to implement wildmat patterns inside of Wombat. It turns out NSPredicate does in fact support regular expressions, as documented in Using Predicates. You simply use the MATCHES operator to specify a regular expression, as shown in a couple of examples in Apple’s documentation. It’s implemented using ICU’s Regular Expressions package, which provides much better documentation than Apple.

The gotcha to note is that the MATCHES operator does not compile down to SQL, so you can’t give it directly to Core Data. Instead you have to pull out all the entities then post-process the array using NSArray’s filteredArrayUsingPredicate: method. It works, but it’s not as efficient as it would be if it were compiled down to SQL.

In a related news, Mike Zornek was kind enough to point out a small Core Data mailing list. Hopefully it will be a source of useful information in the future.