BookMooch was down 12h

July 30, 2009

Sorry about that! BookMooch was unavailable for the past 12 hours. The computer it runs on ran out of memory, so I had to restart it.

I have a new server, with 3 times the memory ready to go, however I am currently on vacation (camping in Europe) so I don’t want to switch to the new machine until I get back.

BM admin Mark has my mobile telephone number and texted me with the problem, and I worked on it when I woke up. There was no data loss – the machine just needed to be restarted, that’s all.

-john (in a sleazy netcafe outside Frankfurt, Germany)

Litop12351

Literary agent Peter Cox interviewed me for his publishing-industry-news radio show. Peter is a great interviewer, making me feel relaxed and asking just a few pointed questions that steer the conversation and make it move naturally to interesting topics.

http://www.litopia.com/podcast/bookmooch-saves-publishing/

BookMooch was down

July 1, 2009

Hd23
BookMooch was unavailable today for 29 hours, starting from yesterday at 3am, coming back up this morning around 8am (Pacific time).

My main data drive crashed. Nonetheless, there doesn’t appear to be any data loss and BookMooch is now running with a new drive. However, if you think there’s a problem with your account please email the support volunteers and they’ll sort you out.


What went wrong

BookMooch is essentially one big database application, and the database server is where most of the work goes in running the site.

I used to have speed problems with the database, but in the fall of 2008 I changed the hard disk to a solid-state-drive, using what was the best at the time, a server-class drive from Mtron that was widely reviewed as very reliable, and benchmarked at about 5X faster than the fastest hard disks. Since the switch, BookMooch has largely been fast, and the increased speed has allowed me to add some new database-intense features, such as sorting the Topic pages so that popular books are shown first, making those pages much more useful.

Unfortunately, that Mtron solid-state-drive was one of the first generation of these server-class all-memory drives, and that’s what failed yesterday. It’s supposed to have all sorts of fancy features to prevent data loss, and while the drive did become non-writeable, it appears that there wasn’t any data loss (that was $800 well spent!).

In BookMooch’s first year, there were some (ahem) technical challenges, and so I had written programs to audit the database for corruption. I ran those same tests this morning:

1) check every user and book to make sure it is not corrupt

2) check every user and book in the most recent backup and make sure it is in the current database file, and if not, copy it (all were there)

3) check the references between users and books to make sure if you look at a book and it says: “on these wishlists:” or “available from:” that all the people really are listed. My “auditor program” found a few hundred books that didn’t have the links back from books->users, but these are probably not due to database corruption, but old bugs from a while ago. Those references have been fixed now, so it’s possible that some books will show up as being available where they weren’t before because of the missing reference, but again I think this is limited to maybe a few hundred books.


What about backups and fail-over?

Giving23
In February I asked people to volunteer to give a little of money to help BookMooch buy a new server. The really good news is that we’ve raised $15,449 in donations, and received another $6212 from Amazon for commissions on book sales from moochers. That puts BookMooch on a really solid financial footing, so I can afford to buy the hardware that is needed, and the occasional contracted-out-software-job, like the soon-to-be-released BookMooch iPhone app. Also FYI, I have put $31,900 of my own money into running BookMooch for 2 1/2 years, at which point your donations started covering the costs.

The money you’ve given me allowed me to buy a new server to run BookMooch on.

Currently, BookMooch runs a server which has:

* 8 CPUs
* 24 gigs of RAM
* one 32 gig Mtron solid-state drive
* one normal drive for backups

The new server has:

* 16 CPUs
* 64 gigs of RAM
* two 60 gig intel solid state drives
* one normal drive for backups

I had also purchased one extra 60 gig Intel solid state drive as a spare drive for the new server. That drive is what is now in the current BookMooch server. It was really good to have a spare around!

Newserver4 Newserver2 Newserver1

Some people have expressed surprise that a server can cost $10,000. I have pictures of the receipts for the machine above, and the price breakdown is:

* computer with 16 cpus: $6428
* 64 gigs of memory: $2087
* 3 solid state drives: $2691
* total = $11,202

Moving to “the cloud” doesn’t change the price of these components: you just end up paying for the same thing in monthly fees rather than up front.

I bought the new server about a month ago, unfortunately I’m having reliability problems with it (it crashes under heavy load) so I’m going back to the manufacturer (super micro) to get that fixed.

The new server will run the two drives in a RAID1 array, so that if one drive dies, the other has a complete, up-to-the-second copy of the same data. That *should* prevent the kind of problem that just occurred.

The new drives are also much larger, at 60gigs each. Since the BookMooch database is 24gigs in size, that means I’ll be able to make a live copy of the database on the solid state drive itself. This is important, because otherwise making a backup to a normal drive is slooooowww and causes “locking” problems with the database, since BookMooch is heavily used internationally, and doesn’t have a “slow period of the day”. Faster backups mean more frequent backups.

My plan is to keep the old server around as a fail-over, in case the new server itself dies.


Cloud235
Why aren’t you using S3 / Google Apps / the Cloud to host BookMooch

BookMooch is a very intense database-backed application. It handles 400,000 queries *per second* on a regular basis. The benchmarks I’ve seen for Google apps put it at 700 queries per second and S3 is at several thousand queries per second.

The best parallel I can give you is Amazon: the size of the database in BookMooch is on a scale similar to Amazon’s. Naturally, they have far more use than BookMooch, but the size of the database in BM is quite large and the operations it does are quite complicated.

So yes, I could move BookMooch to a service, but it would require lots and lots of machines, as well as much more complexity because there would then be many machines running BookMooch, with some of them having hardware failures of their own.

Also, and this is definitely a major issue, all the “cloud” vendors charge per-database-transaction, and so it’s likely that a very-database-intensive application like BookMooch could be very expensive to host in a cloud service.

My background is in high performance computing, so I guess you’ll have to just trust me on this one. If you read the academic computing literature in this field, you’ll find that very large memory-based caches are standard. For instance, Facebook runs memcached servers achieving 300,000 queries per second with 32 gigs to 64 gigs in memory per cache server. Generally, when your application is mostly database-blocked, you want lots of memory, and that’s a much cheaper way to go (up to a limit of about 64 gigs) than adding lots of cheap machines with not-much-memory together.

BookMooch down

July 1, 2009

I’m having trouble with the main hard drive that crashed at 3am, and I’m working on it now.

I don’t yet know when it will be back up. I can’t copy the data off the drive at the moment, so I’m going down to the machine room to take the drive out, swap it for a spare, and see if I can read the data off from home.

BookMooch may be down a few days because of this problem. I’d rather try extra hard to get all the data off, so there isn’t any data loss from the previous backup, since I know that causes a lot of hardship for people.


Update at 5pm:

I have managed to copy the data off the drive, but there is one “drive block” that “went bad”. I have a program I wrote two years ago that will check each row in the database to make sure it’s ok, and if it isn’t, will use the backup to restore it. That might take a long time (hours and hours), and I will be up overnight to watch it, with the hope of getting things back up tomorrow morning.


Update at 7am:

BookMooch is now back up, and there doesn’t seem to be any data loss. I’m going to write a separate blog entry about it.

-john