Archive for the 'Macintosh' Category

Core Data and Multi-threading

I’ve been wanting to write this article since Tuesday, but I’ve been distracted by my day job. One of our clients is getting close to shipping so I have to put in more hours than usual. It’s no where near as fun as working on Wombat, but it pays the bills.

Anyway, back when I wrote a couple of weeks ago about Wombat, I mentioned the trouble I was having with Core Data and multiple threads. Basically, I was finding that the entire context (NSManagedObjectContext) had to be locked anytime a thread touched the context or any one of its managed objects (NSManagedObject). That included even accessing attributes on a NSManagedObject as well as mutating them.

Apparently I wasn’t the only one who figured this out. Florian Zschocke, creator of Xnntp, told me that he was running into the same problem of having to lock the entire context each time he touched anything. He was also wondering if there was a better way.

The obvious problem with locking every time is that it defeats the concurrency of threads. The threads end up being serialized anytime they touch the data store. This is pretty troublesome for Wombat, because it’s an NNTP server. Most of its time is spent doing I/O –either reading/writing to the data store or reading/writing to sockets. Accessing the data store is already a potential performance hotspot, and the serialization or threads makes it even worse.

Fortunately there’s a better way. Blake Seely left a comment on my previous post, letting me know that the appropriate way to handle multiple threads is to have a separate context for each thread. About this time I also found some Apple documentation pertaining to Core Data and multiple threads, which echoed Blake’s comments. This is as simple as allocating a NSManagedObjectContext each time a thread is spawned, and handing it the solitary NSPersistentStoreCoordinator.

The one gotcha is that NSManagedObject’s from one context cannot be used in another context. If you want to send an object from one thread to the other, you have to pass the NSManagedObjectID around. This can be obtained by [object objectID] from one thread, then used on the other thread by [context objectWithID:objectID] to get the corresponding object in that context. However, this only works for objects that have been saved. In general, Wombat isn’t going to have to worry about passing objects between threads. That’s because each client is pretty isolated and has no reason to talk directly to another client.

That said, having multiple contexts has some implications for Wombat. Currently each client gets its own thread, and thus its own object context. In the future this will probably change, and clients will be pooled together on a few threads that handle multiple clients using something like kqueue to multiplex sockets. The catch is that no client saves its changes until the remote client closes the connection. Currently that means if Wombat has two clients, and Client A posts an article, Client B will not see that article until Client A quits. For performance reasons, NNTP clients often leave the connection open for a specified amount of time after they’ve done their work.

The behavior is acceptable, but it gets a bit more weird when clients start getting multiplexed by a single thread. In that scenario, Client B might see the article immediately if its in the same pool, or it might have to wait until Client A quits. In other words, some clients will see articles sooner than other clients. Once again, it’s acceptable behavior, but it’s a little odd.

Meanwhile, I’ve been reading Another Day in the Code Mines, which has a lot to say about threading. One of the thoughts that I came away with is that forking processes in Wombat would probably be better than spawning threads. That is, for each client that connects, instead of spawning a thread for it, spawn a process to handle it. Processes are heavier weight, but they provide a couple advantages. First, they provide separate memory spaces for each client, so one client can’t mess with another. Second, if one client crashes, it doesn’t take down the entire server, thus making Wombat more robust. Forking for each client also happens to be a classic NNTP server design, and for good reason.

Unfortunately, as far as I can tell, Core Data doesn’t support this. Multiple contexts can exist because they all share one NSPersistentStoreCoordinator, which serializes all I/O to the data store file. Since SQLite often updates just parts of the file at a time, I can’t imagine that it would allow multiple processes to have the data store file open at once, especially for write. The only way I see around this is to make the data store file its own server. Unfortunately, this reintroduces the single point of failure (if it goes down, all clients go down) and since NNTP is fairly thin protocol over news, it would just end up being something pretty close to an NNTP server itself. Not a win.

In the end, it looks as though I’m just going to give each thread its own context, and then multiplex several sockets on each thread using kqueue. It may not be as robust as forking processes, but it should be possible to get some good performance out of it.

Documented in code is worse than not documented at all

I’m still working through RFC 2980 while implementing my NNTP server, Wombat. I’ve gotten to the point where I’m implementing the LIST SUBSCRIPTIONS command. The problem is the documentation in the RFC on this command is a bit light:

This command is used to get a default subscription list for new users of this server. The order of groups is significant.

When this list is available, it is preceded by the 215 response and followed by a period on a line by itself. When this list is not available, the server returns a 503 response code.

There are a couple of missing pieces of information here. First, the comment “The order of groups is significant” worried me. Why is the order significant? The answer to that question would effect how I implemented the command. Does the order need to match the order returned by the LIST command, which lists all possible groups? Does it need to be in hierarchical order (i.e. alt.startrek comes before alt.startrek.deepspace9)? Does the client assume any new default groups will be added at the end, so it can easily pick them up?

I search the ‘net for an answer, but couldn’t find anything. I had INN installed, so I went searching through its code and man pages. That’s when I learned that order was sometimes important because it would be presented to the user in that order. Therefore, more important groups for new users should be towards the top. That’s the only reason order is “significant.”

The second missing piece of information I didn’t even notice until I started trawling through INN code. I had assumed the returned list of groups would be formatted the same as LIST and LIST ACTIVE. That is, it would be the group name followed by the start and end article number, and if posting is allowed. However, I discovered by looking at INN that it was simply a list of group names, each one on its own line. The format of the list isn’t mentioned anywhere in RFC 2980.

This is what the RFC sometimes refers to as “documented in code.” This term is very misleading. What it really means is “this isn’t documented at all, plus we will actively try to mislead you.”

I’ve had to deal with “documented in code” before. Back when I was working for Macromedia (now Adobe), I implemented the MX interface on the Mac. The MX interface was cross product branding, intended to make the panels used in all the different products (Dreamweaver, Flash, Fireworks, and FreeHand) look and feel identical. The Windows side had been implemented first. When I asked for a specification to implement, I was promptly told to just make it work like Windows.

I wanted to throttle the person that told me this. There were several problems that this caused.

First, I spent a lot more time implementing this feature than I should have. That’s because I had to dig through Windows code, and test the Windows code just to figure out what I was supposed to be implementing. I was further hindered by the fact that the Windows code was sample code, not shipping code. i.e. Not only was it Windows code, it was poorly written Windows code. Contrast that with the effort needed to read and understand a well written specification. By the way, I was assured we “didn’t have time” to write a specification for the feature.

Second, the specification was continually changing. I wasn’t told about any of this, nor was the Windows sample code updated as frequently as it should have been. The designer would simply occasionally come into my office and tell me something didn’t work right. I have no idea how the QA people figured out what was a bug and what wasn’t. I certainly didn’t know. As a result, I managed to faithfully reimplement bugs in the Mac that were in the Windows sample code.

Thirdly, there were some Windows idioms in the sample code that didn’t translate into Mac idioms. I had to question the designers about it, and often got a wishy-washy answer or no answer at all (“we’ll get back to you.”).

Trying to figure out a specification by looking at an implementation is nigh impossible. It’s difficult to determine what’s an implementation detail and what’s really part of the spec. It forces you to know and understand an entire software system, just to figure out how a certain feature is supposed to work. It’s very easy to miss a small detail that changes how the feature works entirely. That’s what I meant by “…plus we will actively try to mislead you.”

It also means that the specification is often incomplete. The implementation only deals with the specification from one perspective. i.e. If it’s a server, it only implements a command, it doesn’t show how a client might deal with the response of the command. As with the MX interface, the Windows implementation said nothing about how Mac specific details should be implemented.

If you trying to define a specification, reference implementations can be very helpful. However, they are not a specification. Until you have a real specification, don’t claim it’s “documented in code.” Simply tell it how it is: “We have no specification.”

NSPredicate and regular expressions

I’ve managed to figure out how to implement wildmat patterns inside of Wombat. It turns out NSPredicate does in fact support regular expressions, as documented in Using Predicates. You simply use the MATCHES operator to specify a regular expression, as shown in a couple of examples in Apple’s documentation. It’s implemented using ICU’s Regular Expressions package, which provides much better documentation than Apple.

The gotcha to note is that the MATCHES operator does not compile down to SQL, so you can’t give it directly to Core Data. Instead you have to pull out all the entities then post-process the array using NSArray’s filteredArrayUsingPredicate: method. It works, but it’s not as efficient as it would be if it were compiled down to SQL.

In a related news, Mike Zornek was kind enough to point out a small Core Data mailing list. Hopefully it will be a source of useful information in the future.