BookMooch will be a bit slow

June 2, 2008

Snail1
I’m experimenting with some changes on BookMooch, in order to deal with some of the growth the site has experienced.

That means BookMooch might be a bit slow for the next 48 hours, while I try a few experiments, enabling and disabling various technologies.

Update on Tuesday, June 3rd: I’m done with my tests, and BookMooch should be fast again now.

An aside: I received an email from the New York Daily News today, and they asked me how many mooches there are per month.

I wrote her:
In the past 30 days, 65,263 books have been swapped. We’ve grown a lot recently: on January 1st, the past 30 days yielded 38,948 swaps, so that we’ve grown 67% in the past 6 months.


Notes for techies…

The rest of this blog entry is only interesting to sysadmins and programmers, for those interested in the tech details of what I’m up to…

sorting

Pretty much every page at BookMooch needs to alphabetically sort a list of books. Sometimes, like with an inventory, it’s a small list, but other times, such as a search for “science” or look at a topic like “fiction”, there are a half-million books to sort.

I have to use a custom sort algorithm, because people want books sorted by “Author’s last name, Author’s first name, then book title”, and it can be tricky to figure out what an author’s last name is. Hence, a custom sort.

To speed up the sorting, I maintain a database table, which for each ISBN, has the pre-built “sort key” for each book (author’s last name, etc…). I then load that up into memory.

Nonetheless, sorting a half-million books just for one user’s page request is pretty time-consuming.

So, what I do is save the sorted results in a disk cache, so that if the same unsorted list of books is wanted, presto! it’s already there, all sorted.

However, if that sorted list changes even by just one book (which it would frequently if it were a search for “science”) then that cached, sorted list of books isn’t of any help.

Also, that’s a lot of writing to the disk, especially if a lot of those cached results aren’t useful.

This “sort cache” gets pretty big, about 10gb in a week’s time. Big enough, that using the cache isn’t nearly as fast after a few weeks as when we started, which is why sundays (when we backup) get slower after a few weeks.

Tomorrow, I will try a relatively small (100mb) in-memory sort cache, to see what the performance characteristics of that are. I’ll use an MRU cache algorithm, and that should speed up common searches and “next/previous” page requests, while not causing disk writes and having to delete old cache entries.

slow drive reads

Web site usage of BookMooch is currently at 400,000 page requests per day, of which 75,000 of those are searches. This is a 50% increase since the new year, and I have no reason to think the growth will stop.

The BookMooch database, at 14gb, is getting a bit large to hold in memory, and a I foresee a time, soon, when most of the database won’t fit in memory. Putting the database in memory is what I’ve been doing thus far to increase performance, and that technique has worked well.

Unfortunately, disk read speeds are pathetic, at around 1200k/second (peaking at 6mb/sec). They’re this slow, despite my using the newest, highest-speed server-class hard drives. I believe they’re this slow because the database mostly does small random-access reads, which are notoriously inefficient.

If I restart BookMooch, I generally pre-load the entire database into memory before letting the public at the site. However, this now takes almost an hour, which is a long time to be down. Today, I just put the site up without preloading the database into memory, so that it runs slowly, but it’s available right away, and the database loads into memory over the course of a few hours.

I’m thinking that a possible solution to my performance problems is to use a solid-state-drive (SSD). The random-access read times are amazingly fast, but random writes are not so fast.

My experiment today was to turn off the on-disk sort cache, in order to lower the number of disk-drive-write operations I need to perform. That’s why BookMooch is running slowly right now. I want to see, however, once the memory cache is filled, how much of a performance hit this causes.

I’m looking at the drives from Mtron and this SSD drive in particular. This blog by the CTO of Spinn3r has many postings about SSDs in server environments, and the results are encouraging.

Over the next few weeks, I’m going to try tweaking some things, and will likely switch to an SSD as the main BookMooch drive.

Update the next day: disabling the disk cache caused too big a performance hit. So, I tried using a 1GB in-memory cache instead of the on-disk cache, and that seems to work well, so that’s what I’ll stick with for now. Speed is good, though the server-load is higher than usual, but this may be due to the fact that I’m recalculating “recommended books” for each book as people visit them, whereas previously the recommendations never changed. I won’t know for sure for a few days. The web site feels very responsive, though.

20 Responses to “BookMooch will be a bit slow”

  1. Mason said

    Why not just bump up the RAM in your host? The filesystem cache can help out quite a bit there (as long as you remember to leave the RAM free), or you could even use a RAM disk. Probably a cheaper option than SSD.

  2. I already have 24gb of RAM on the BookMooch machine (it is running a 64bit Linux), and the motherboard will max out at 32gb. I will at some point swap the machine out for one that can take more RAM, maybe 64GB, as RAM prices go down.

    However, there are some other issues with adding more ram:

    1) drive drive read/write times are still an issue. Reading is slow, so getting data into the RAM takes a long time (1h currently), and

    2) writing to the disk is slow (which ram doesn’t help with), and we write 36gb a day currently, which takes about 5 seconds out of every minute. Eventually, this will be a problem.

  3. Shauna said

    Thanks for letting us know.

  4. Wanda said

    Thanks for letting us know! I thought that it was my computer.

    Wanda

  5. Lisa said

    I always assume it is my computer. Thanks for the headsup. Isn’t is great that we are growing so much? Wow, it is an awesome feeling of power that there are enough of us moochers to cause havoc in our little corner of the world.

  6. Heather19 said

    so, for a non-techie who loves learning about this stuff…

    Why is the reading/writing to the disk so slow? Is it something that can be helped with more RAM, or do you need more of something else to make it faster?

  7. Jeff F said

    Heather – disks are inherently limited because they’re mechanical. RAM, both in a computer as memory and on flash thumb-drives or digital camera memory cards (or, say, solid state drives…) is largely electronic in nature: it’s a bunch of electricity floating around in some silicon. Very fast!

    Traditional hard drives, however, are basically a short stack of really, really dense phonograph records, and are read in the same way an old-fashioned record (sorry for anyone over the age of 30 reading this) is, with a read head on the end of a little arm that sticks out over the spinning disc.

    So hard drives are limited by how fast the discs can spin (the faster they spin, the quicker the read head can get to the part of the disk it needs info from) as well as how many discs you can pack into one drive (since the more discs you have in a stack, the more info you can get out of it at one time, there being one read head per disc platter).

    So, if you compare a mechanical device spinning some metal discs around and around, versus something that approaches the speed of light (electricity) it should be obvious who the winner will always be 🙂

    Long story short, SSDs are starting to come into fashion because we really can’t make normal hard drives any faster, and normal hard drives are extremely slow compared to memory-based stuff like system RAM (which is what BookMooch “runs off of” right now) or SSDs (which is what John wants to move to).

    Also, new Moocher here, so thanks heaps to John for not only the hard work but the technical chops to run a site like this *and* the communications skill it takes to blog about it in such an open manner! 🙂

  8. victor said

    just fyi the thing that killed it for me yesterday were the search misses – trying to add 10 isbn’s in bulk add in which 3 isbn’s were wrong or not there pretty much made the feature unusable – not sure if that’s related to anything here.

  9. Kath said

    I/O operations on physical hard drives are inherently slow due to the physical constraints of how fast the platters can spin and how fast the heads can read data from the platters. As John said, a lot of the data searched for is placed randomly all over the disk which takes longer than retrieving contiguous data from a HD. Solutions to this are to store the data in a way that most queries will be contiguous (not feasible due to the many different search parameters, if searches were always by author, for example, this could be done) or to cache the data in something more quickly accessed like RAM. However, whatever you cache leaves you with less working RAM. Also, switching from hard drive to flash memory would solve the I/O speed problem, but is cost prohibitive.

  10. Elizabeth ("fullmoonblue") said

    Just had to say… the snail cartoon is very cute. It reminded me of a New Yorker cartoon I love. Here’s a small copy:

    And a funny take-off:

    Have fun with the techie stuff!

  11. John, maybe you’ve already thought about this but have you considered a RAID Disk Array. Spreading the data across multiple disks would involve a greater hardware cost, depending upon how many disks you used, although they’re pretty cheap relatively speaking these days. Using RAID5 for instance, would give not just improved data throughput performance but greater data security as well. With RAID5 you can lose a whole disk and all the data will still be accessible. There’s a useful article on RAID on Wikipedia if you want more information. What database are you using?

  12. re: RAID5

    I’ve used RAID in the past, and always have had data loss issues, as have other sysadmins I know. Google doesn’t use RAID, and neither does the Internet Archive, and I know the Internet Archive tried, but had RAID reliability issues as well.

    Also, RAID5 is faster, but not 300x faster, like a solid-state-drive vs a traditional fixed-disk drive.

  13. Ryan said

    OK, I love that you’re sharing this info. I have some advice, which you probably already know, but I’m going to share anyway.

    As BookMooch grows, adding more RAM and switching to Solid State Drives are really just temporary solutions that won’t scale very well, or for very long. If you put these in place, before long you’ll be right back in the same position, but eventually you won’t be able to add more RAM or more SSDs. You need a solid architecture redesign.

    You’re getting into the area of needing to think about database sharding and whatnot. Also, memcached is a wonderful tool to aid in caching queries and dynamic pages. There are probably hundreds of areas where a little bit of cache will help speed things up, but as far as the database goes, you’re definitely going to need to start considering some type of partitioning and clustering.

  14. re: sharding & partitioning

    Yes, I know that I will eventually need to use those techniques, and the bookmooch architecture already supports it.

    If you want to read about BerkeleyDB’s replication technology, take a look at http://www.oracle.com/technology/products/berkeley-db/feature-sets.html — this is replicated one-writer, multiple-reader, and will scale well as long as data-writing is kept under control. What I’m currently experimenting with, is what the effects of not using disk-based caching are to bookmooch performance, since writes are both slow, but also perform poorly in a replicated environment.

    This replicated BerkeleyDB is currently in use by Google for their “single sign-on” feature.

    I do know all about memcached, and its persistent variant, called memcachedb, actually uses BerkeleyDB as its engine http://memcachedb.org/

    So do not worry, BookMooch can scale to very great heights. The issue for me now is to get the most I can out of one machine, so that costs will be minimized when I need to go to a multi-machine setup.

  15. I really love that you’re sharing this technical info. Obviously it provides knowledgeable people the chance to chip in and try to help, but for a moderately technical person like myself, it’s a great learning experience, and, of course, it promotes transparency, which should reduce complaints.

  16. kirsty said

    You can tell the sort of programmer I am because my first thought on your sorting problem was: does it really need sorting?

    For inventories and small lists, yes, I want to see an author’s titles grouped together. If someone searches for a general term though are they really after an alphabeticalised list?

    Alphabetical order is useful for finding a specific thing in a list, but if you want to find a specific thing among that many items then you can search for it using a specific term. A list of half a million items isn’t what a human wants to read even if it’s in order.

  17. re: “does it really need sorting?”

    You make some very good points.

    It’s true that nobody wants to look through 500,000 books with the word “science” in their title or topic, as a result of searching for “science”.

    If there are only 20 books returned to a search, sorted results are nice.

    However, I think that search results might be more interesting sorted by “relevancy”, which probably means moving the most-popular books to the top of the list, either by using the “number of times mooched” or “amazon sales ranking”.

    This kind of sort would probably be more helpful for any list of books that goes on for more than 2 pages.

  18. Neil said

    It would seem to me this is sort of a ‘solved problem’ in that various places do this type of sorting (libraries, amazon,etc). Could you not offload it to Amazon somehow? Perhaps use their cloud computing service?

  19. I actually think that the “does it really need sorting?” question points in an interesting direction: how much would adding using LIMIT to cut down the results returned improve the situation? I suppose a corollary to this is how much of the server resources are being used for queries which are too large to be digestible by a human?

  20. David wrote:
    how much would adding using LIMIT to cut down the results returned improve the situation

    Unfortunately, I need to get all the results, to find out which books are available, as the default search results page is to show only available books. Since the search index has ALL books, not just available ones, I need to get all the results anyway.

    At any rate, this problem seems to be solved for now, as a 1GB in-memory cache seems to have had a great performance impact, and I’ve been running that for a few days, as the outcome of my various techniques found that to work best.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: