Archive for October, 2006

Merging multiple contexts in Core Data

Last time I was working on Wombat, I was trying to get its Core Data multi-threading use correct. Namely, I stopped sharing one context (NSManagedObjectContext) between all threads, and created a new context for each thread. This meant I didn’t have to lock the context each time it was touched, which resulted in better performance. I thought that it would be simple as that. Core Data would take care of the merging of the multiple contexts, and all would be good.

I was close, but it’s not quite that simple.

The first problem I ran into was the possibility of two different clients, each on its own on thread and thus context, posting the same message. Since contexts aren’t saved until a client disconnects, that would work for the first client who quit, but the second one would run into trouble. It would most likely succeed in its save, assuming it had no merge conflicts, but it would compromise the data integrity of the data store. That is, I shouldn’t have the same post in the database twice. That’s bad.

Now the NNTP protocol specifies that each post should have a unique ID, called a Message-ID. Its simply a blackbox string that is supposed to be globally unique. Typically it is a time stamp concatenated with the local host name, and enclosed in angle brackets (<>). Some NNTP clients attempt to generate the Message-ID themselves, but most are smart and allow the server to generate one on their behalf. That means that the odds of receiving duplicate messages from clients are pretty much nil, although the situation still has to be considered.

It’s far more likely that two peer servers connect to Wombat, and offer it the same message that was posted elsewhere on the network. This is still somewhat unlikely, because direct peers should be somewhat rare, and thus the likelihood of them connecting at the same time would be low.

I bring up the rarity of the these events for a reason. If these collisions were frequent it means I should probably go back and consider locking down and updating the contexts more often. That way I always know what messages I have and which ones I don’t. i.e. I should make sure I do preventative maintenance. The downside of preventative is obvious: its slower because I have to bottleneck all threads and update the data store. However, since the message collision events are (theoretically) rare, then I can just assume the thread’s current context is up-to-date. Then, when the thread goes to merge any changes to the context into the data store, it can handle any conflicts then.

Before this point, Wombat never made use of any of Core Data’s validation methods. That’s because everything was serialized through one context and I manually validated the incoming data before I inserted it into the context. For example, before I inserted a message I searched for any messages in the context with the same Message-ID. If I found one, I simply didn’t insert the new message, thus maintaining the integrity of the data store.

Now I needed to be able to catch any duplicates when I went to save the context to the data store. My first thought was: “wouldn’t it be great if Core Data modeling allowed me to specify an attribute as unique?” It would be, but alas, Core Data doesn’t allow it. If I could specify an attribute as being globally unique then I wouldn’t have to write any validation methods, but just let Core Data catch them for me. I also wonder if SQLite would be able to do anything with an attribute if it knew it was unique. For example, create an index on it so searching was quicker.

Anyway, dreams aside, I needed to write a validation method. The first thing I thought of was the validation method generated by Xcode for each attribute. The one of the form:

- (BOOL)validate: (id *)valueRef error:(NSError **)outError;

where is the name of the attribute. The problem with this is that Message-ID never changes, and I actually only want to validate when a new message is inserted into the data store. After searching around the documentation some more, I discovered:

- (BOOL) validateForInsert:(NSError **)error;

It’s a pretty easy to use method, and it only gets called on an inserted object. I simply added code at this point to check for more than one message with the same Message-ID. If I found more than one, I returned NO and put an error in the out parameter. The error parameter is passed back to the call to [NSManagedObjectContext save:&error], so I can do useful things like stuff the offending object into the error and the person who called save will get it.

Now I’d like to take an intermission to rant a bit. The error mechanism here is ghetto at best. If you’ll notice the out error parameter only points to one error, as does the error parameter in save. So what happens if you have more than one error, because, say, you have six messages that are duplicates? You curl up in a corner and cry, that’s what. Then you have to use some convoluted logic to pretend the API was designed to support multiple errors.

First, you have to check to see if the error parameter is nil. If it is, you just jam a pointer to your error in it. If it’s not nil, then you have to check to see if it’s a special error that’s designated as “multiple errors.” If it’s the special “multiple errors” error, then you create a entirely new “multiple errors” error with all the old stuff in it, plus your error added to its array of multiple errors. If the current error is not the special “multiple errors” error, then you have to create one, and jam the old error and your new error in it. Fun stuff.

Hey Apple, you wanna know what else would have worked? An NSMutableArray of NSError’s. Crazy idea, I know.

Anyway, after I figured out how to get errors back to the caller of save, I needed to do something about them. Fortunately processing duplicates are easy. You just deleted them. I thought about doing this inside of validateForInsert, but some of the Apple documentation advises against mutating the context inside of validation routines. Instead, the caller of save just walks the list of errors (which has its own special logic to produce an array of errors out of a single error) and deletes the duplicates, then attempts the save again.

At this point, I thought I was home free. But save kept returning merge errors even after I had deleted the duplicates, and I didn’t know why. The answer turned out to be the merge policy on the context. By default, the merge policy is “don’t merge, and report anything that doesn’t merge.” I’m sure that’s a fine policy for applications who never use more than one context per data store, but it doesn’t work so well in Wombat.

There are actually several merge policies described in Apple’s documentation. NSErrorMergePolicy simply returns an error for each merge conflict, and is the default. NSMergeByPropertyStoreTrumpMergePolicy and NSMergeByPropertyObjectTrumpMergePolicy are similar to each other in that they merge on a property by property basis. They only differ when the property has been changed in both the store and context. In that case, NSMergeByPropertyStoreTrumpMergePolicy takes whatever was in the store. Conversely, NSMergeByPropertyObjectTrumpMergePolicy takes whatever was in the object context. NSOverwriteMergePolicy simply forces all the changes in the object context into the data store. Finally, NSRollbackMergePolicy discards any object context changes that conflict with what’s in the data store.

For Wombat, I chose NSMergeByPropertyObjectTrumpMergePolicy because it does finer grain merging, and because there’s a slim chance what’s in the object context is more up-to-date than what’s in the store.

All in all, merging multiple contexts was harder than I thought, and harder than I thought it needed to be. It would be nice if Core Data could do some more validation automatically (like unique attributes) and if the error handling was better. I also think picking a better default merge policy would help, because when getting the random merge errors, it wasn’t all that obvious to me that the merge policy was the problem.

Core Data and Multi-threading

I’ve been wanting to write this article since Tuesday, but I’ve been distracted by my day job. One of our clients is getting close to shipping so I have to put in more hours than usual. It’s no where near as fun as working on Wombat, but it pays the bills.

Anyway, back when I wrote a couple of weeks ago about Wombat, I mentioned the trouble I was having with Core Data and multiple threads. Basically, I was finding that the entire context (NSManagedObjectContext) had to be locked anytime a thread touched the context or any one of its managed objects (NSManagedObject). That included even accessing attributes on a NSManagedObject as well as mutating them.

Apparently I wasn’t the only one who figured this out. Florian Zschocke, creator of Xnntp, told me that he was running into the same problem of having to lock the entire context each time he touched anything. He was also wondering if there was a better way.

The obvious problem with locking every time is that it defeats the concurrency of threads. The threads end up being serialized anytime they touch the data store. This is pretty troublesome for Wombat, because it’s an NNTP server. Most of its time is spent doing I/O –either reading/writing to the data store or reading/writing to sockets. Accessing the data store is already a potential performance hotspot, and the serialization or threads makes it even worse.

Fortunately there’s a better way. Blake Seely left a comment on my previous post, letting me know that the appropriate way to handle multiple threads is to have a separate context for each thread. About this time I also found some Apple documentation pertaining to Core Data and multiple threads, which echoed Blake’s comments. This is as simple as allocating a NSManagedObjectContext each time a thread is spawned, and handing it the solitary NSPersistentStoreCoordinator.

The one gotcha is that NSManagedObject’s from one context cannot be used in another context. If you want to send an object from one thread to the other, you have to pass the NSManagedObjectID around. This can be obtained by [object objectID] from one thread, then used on the other thread by [context objectWithID:objectID] to get the corresponding object in that context. However, this only works for objects that have been saved. In general, Wombat isn’t going to have to worry about passing objects between threads. That’s because each client is pretty isolated and has no reason to talk directly to another client.

That said, having multiple contexts has some implications for Wombat. Currently each client gets its own thread, and thus its own object context. In the future this will probably change, and clients will be pooled together on a few threads that handle multiple clients using something like kqueue to multiplex sockets. The catch is that no client saves its changes until the remote client closes the connection. Currently that means if Wombat has two clients, and Client A posts an article, Client B will not see that article until Client A quits. For performance reasons, NNTP clients often leave the connection open for a specified amount of time after they’ve done their work.

The behavior is acceptable, but it gets a bit more weird when clients start getting multiplexed by a single thread. In that scenario, Client B might see the article immediately if its in the same pool, or it might have to wait until Client A quits. In other words, some clients will see articles sooner than other clients. Once again, it’s acceptable behavior, but it’s a little odd.

Meanwhile, I’ve been reading Another Day in the Code Mines, which has a lot to say about threading. One of the thoughts that I came away with is that forking processes in Wombat would probably be better than spawning threads. That is, for each client that connects, instead of spawning a thread for it, spawn a process to handle it. Processes are heavier weight, but they provide a couple advantages. First, they provide separate memory spaces for each client, so one client can’t mess with another. Second, if one client crashes, it doesn’t take down the entire server, thus making Wombat more robust. Forking for each client also happens to be a classic NNTP server design, and for good reason.

Unfortunately, as far as I can tell, Core Data doesn’t support this. Multiple contexts can exist because they all share one NSPersistentStoreCoordinator, which serializes all I/O to the data store file. Since SQLite often updates just parts of the file at a time, I can’t imagine that it would allow multiple processes to have the data store file open at once, especially for write. The only way I see around this is to make the data store file its own server. Unfortunately, this reintroduces the single point of failure (if it goes down, all clients go down) and since NNTP is fairly thin protocol over news, it would just end up being something pretty close to an NNTP server itself. Not a win.

In the end, it looks as though I’m just going to give each thread its own context, and then multiplex several sockets on each thread using kqueue. It may not be as robust as forking processes, but it should be possible to get some good performance out of it.

Documented in code is worse than not documented at all

I’m still working through RFC 2980 while implementing my NNTP server, Wombat. I’ve gotten to the point where I’m implementing the LIST SUBSCRIPTIONS command. The problem is the documentation in the RFC on this command is a bit light:

This command is used to get a default subscription list for new users of this server. The order of groups is significant.

When this list is available, it is preceded by the 215 response and followed by a period on a line by itself. When this list is not available, the server returns a 503 response code.

There are a couple of missing pieces of information here. First, the comment “The order of groups is significant” worried me. Why is the order significant? The answer to that question would effect how I implemented the command. Does the order need to match the order returned by the LIST command, which lists all possible groups? Does it need to be in hierarchical order (i.e. alt.startrek comes before alt.startrek.deepspace9)? Does the client assume any new default groups will be added at the end, so it can easily pick them up?

I search the ‘net for an answer, but couldn’t find anything. I had INN installed, so I went searching through its code and man pages. That’s when I learned that order was sometimes important because it would be presented to the user in that order. Therefore, more important groups for new users should be towards the top. That’s the only reason order is “significant.”

The second missing piece of information I didn’t even notice until I started trawling through INN code. I had assumed the returned list of groups would be formatted the same as LIST and LIST ACTIVE. That is, it would be the group name followed by the start and end article number, and if posting is allowed. However, I discovered by looking at INN that it was simply a list of group names, each one on its own line. The format of the list isn’t mentioned anywhere in RFC 2980.

This is what the RFC sometimes refers to as “documented in code.” This term is very misleading. What it really means is “this isn’t documented at all, plus we will actively try to mislead you.”

I’ve had to deal with “documented in code” before. Back when I was working for Macromedia (now Adobe), I implemented the MX interface on the Mac. The MX interface was cross product branding, intended to make the panels used in all the different products (Dreamweaver, Flash, Fireworks, and FreeHand) look and feel identical. The Windows side had been implemented first. When I asked for a specification to implement, I was promptly told to just make it work like Windows.

I wanted to throttle the person that told me this. There were several problems that this caused.

First, I spent a lot more time implementing this feature than I should have. That’s because I had to dig through Windows code, and test the Windows code just to figure out what I was supposed to be implementing. I was further hindered by the fact that the Windows code was sample code, not shipping code. i.e. Not only was it Windows code, it was poorly written Windows code. Contrast that with the effort needed to read and understand a well written specification. By the way, I was assured we “didn’t have time” to write a specification for the feature.

Second, the specification was continually changing. I wasn’t told about any of this, nor was the Windows sample code updated as frequently as it should have been. The designer would simply occasionally come into my office and tell me something didn’t work right. I have no idea how the QA people figured out what was a bug and what wasn’t. I certainly didn’t know. As a result, I managed to faithfully reimplement bugs in the Mac that were in the Windows sample code.

Thirdly, there were some Windows idioms in the sample code that didn’t translate into Mac idioms. I had to question the designers about it, and often got a wishy-washy answer or no answer at all (“we’ll get back to you.”).

Trying to figure out a specification by looking at an implementation is nigh impossible. It’s difficult to determine what’s an implementation detail and what’s really part of the spec. It forces you to know and understand an entire software system, just to figure out how a certain feature is supposed to work. It’s very easy to miss a small detail that changes how the feature works entirely. That’s what I meant by “…plus we will actively try to mislead you.”

It also means that the specification is often incomplete. The implementation only deals with the specification from one perspective. i.e. If it’s a server, it only implements a command, it doesn’t show how a client might deal with the response of the command. As with the MX interface, the Windows implementation said nothing about how Mac specific details should be implemented.

If you trying to define a specification, reference implementations can be very helpful. However, they are not a specification. Until you have a real specification, don’t claim it’s “documented in code.” Simply tell it how it is: “We have no specification.”