Big BM back-end changes

November 8, 2007

A new version of BookMooch went up today, with some really big back-end changes.

Despite my testing, it’s only me doing all the work at BookMooch, so it’s likely there will be some bugs. Post new bug reports them here (as comments) and I’ll get to them quickly.

Here is what is new in today’s version:

– an option to opt out of related editions emails is now available, either as a click-through URL on the emails themselves, or as a enable/disable setting in your member profile page.

– an API for “pending action” has been added, allowing access to all the BookMooch workflow steps that are present in the “pending” tab, such as “delay” “send” “reject” “received”, etc… In addition, the API docs have been overhauled.

– the search engine has been rewritten from scratch: this is explained in greater detail below.

  • I also fixed a big bug I found: previously, if you searched for two words, each one matching a different part, such as “card game”, the book “Ender’s Game” by Orson Scott Card would *not* be displayed, because no book was found where both words (“card game”) were found in the same section (ie, both in the title, or both in the author name). This was probably the biggest reason people had trouble with the search results, and this is now fixed.
  • I’ve also made it so if you search for a username (ie, “johnbuckman”) the books in that user’s inventory are displayed in the search results (but only if no books were found with that word in them, so if somone’s username is “john” then the user’s books won’t be displayed, only the search hits for “john” will be).

Geeky details about the search engine rewrite

I’ve been concerned that “writing changes to the hard disk” would be the next limit that BookMooch would confront.

Let me explain: I currently have disk writes cached, writing them out to disk every 60 seconds. What concerns me is that my web server is regularly alerting me that the disk writes are taking a long time:

Exce

If a disk write takes 20 seconds to complete, and runs every 60 seconds, that means that every 60 seconds, I “store up” 20 seconds of changes. The worry is that if BookMooch usage goes up, I will “store up” more changes than I can write to disk in 60 seconds, and then I won’t be able to keep up. And then, BookMooch goes boom.

I needed to find out what most of the disk writes were, and if there was a way to optimize that.

I wrote a function to log all disk writes, the table name, and the number of bytes written. I ran the logging function on BookMooch for 24h, and here is what I found (column 1 is the table name, column 2 is the number of bytes written in 24h):

Dwri

You can see that the last entry is “search_asin”, which is the table that is the search index for books. That one table is responsible for 80% of the disk writes on BookMooch. Clearly, I needed to change that.

Before the rewrite, each row in the search_asin table looked like this:

  • key: photography
  • value: Title {3829029888 1932209638} Topic {1932209638 1560655720 0007143729}
  • Each word that ever occured in a Title/Topic/Author/Publisher had a row in this table, and that row listed the books that used that word, and where that word was used (ie, in the title, in the author name, etc)

    This is a very space and speed efficient way of storing this kind of information. When a user does a single word search, BookMooch just needs to do one database row fetch, and all the books matching the word are found.

    Unfortunately, this efficiency came at a high price: adding books is very inefficient.

    For example, if I add a book whose title is simply “Photography” and whose ASIN is 0330284711 I first fetch this entire row:

  • old value: Title {3829029888 1932209638} Topic {1932209638 1560655720 0007143729}
  • I add the ASIN to the Title section of the row, and rewrite it out:

  • new value: Title {3829029888 1932209638 0330284711} Topic {1932209638 1560655720 0007143729}
  • that’s fine for words that don’t happen often. However, words like “book” occur VERY often (over 300,000 books mention it) and so every time a new book is added that uses the word “book” in its title or description, I have to read and write back 300,000 ASINs. At 10 bytes each, that means 3 megabytes of data that have to be written to the disk.

    So that’s why search_asin is responsible for 80% of the disk writes — the current method was very in-efficient at adding new books.

    I’ve rewritten the search table to use a new method. Now, each row look like this:

  • key: {k Photography} Title 1932209638
  • value: 1932209638
  • when you search for the word Photography, the database is asked to return the first row that is greater than “{k Photography}”, store the value for that row (which is the ASIN) and then find all the other rows that match “{k Photography}”, and return all the values found. Because the database stores its keys in sorted order (it’s what’s called a B-Tree, or balanced tree) this actually works.

    This greatly improves the efficiency of adding books to the database, because each word for each book in each match (title / publisher / author, etc) now gets its own row, and there is no need to read-in then write-back the same data.

    However, this change in algorithm does come at a price:

    1) the search_asin table is now 10x larger (from 339mb to 3.0gb) because of the duplication of key data. This makes the in-memory cache less efficient. However, since most words in the index are probably never searched for, this probably doesn’t matter.

    2) searching is now slightly less efficient, since many rows need to be read, rather than just one row. However, since the database is largely in memory, and I wrote this search function in C, the difference in performance seems minimal.

    On the positive side, the last time I rebuilt the full text index on BookMooch, it took 12 days (with the old system). With the new system, it took only 5 hours on my iMac, so clearly the new method is much more efficient at adding books.


    The “browse books by topic” feature had to be rewritten too. You can see in the “database write log” above that “search topics” is the 3rd most written-to table. It used to use a similar algorithm to the search_asin table, and now they both use the same (new) technique.

    However, this also came at a price, namely that the old browse topics page:

    Brop

    used to show how many books each topic had. This actually wasn’t that useful, because the number displayed was the number of books in the database with that topic, and not whether any of those books are actually available.

    The new browse topics page just displays the top topics based on books available. I made this list by hand based on the number of books actually available for each topic, so it’s probably a more helpful list:

    Bo2

    Unfortunately, you no longer can view the “top 1000” topics. Sorry: the new “browse topics” database table structure just doesn’t permit me to generate that information in any sort of efficient way (there are 260 million rows in that table, it takes hours to go through it)


    Noise Words

    Most full text algorithms try to not index “noise words” such as “the” “and” “or”, etc… The thinking is that

  • it’s very inefficient to store and search for these common words
  • they don’t convey much information

    Instead of just guessing what the most common words are, I wrote a program to count them. Here are the top 30 most common words to appear in a book title, author, or publisher, along with how many books at BookMooch feature that word:

  • books = 392873
  • the = 390160
  • fiction = 298482
  • of = 280274
  • literature = 239599
  • science = 198132
  • history = 197775
  • and = 186755
  • nonfiction = 141293
  • children = 140089
  • reference = 138383
  • world = 128536
  • press = 128413
  • sciences = 109308
  • religion = 103774
  • social = 103568
  • to = 102888
  • in = 101739
  • inside = 97158
  • united = 96758
  • look = 96630
  • trip = 95264
  • states = 94317
  • fantasy = 92763
  • spirituality = 89509
  • health = 87466
  • classics = 86824
  • historical = 84328
  • technical = 83412
  • professional = 81118
  • You can see that some of the most common words, such as “science” and “history” are hardly “noise”. But, if you search for “the science of history” you are asking BookMooch to look through:

  • the = 390160
  • science = 198132
  • of = 280274
  • history = 197775
  • total books for this search = 1,066,341
  • Yes, that’s a million book matches for those words. Ugh. On my iMac, that search takes about 5 seconds of one CPU. That’s pretty slow.

    For now, I’ve decided that because these common words are hardly “noise” (they are actually quite meaningful) that I’m leaving them, so that for instance a search for “the classics” does include the word “the” as a required hit in the search results.


    Other database changes

    If you look back to the top of this blog entry, at the list of tables that get the most writes, you’ll see that the top 3 are

  • search_asin
  • userids
  • search_topics
  • if you remove “search asin” and “search topics” tables from the stats, since presumably they are more write-efficient now, that leaves the “userids” table.

    The userids table, like most tables at bookmooch, stores most of its data in a single row, in an alternative key/value structure, like this:

  • key: johnbuckman
  • value: willsend askme country UK postal {John Buckman\n12 Somewhere Street\nWC1E 9JR London} etc..
  • this is very efficient for data reading, since all the information on a member is loaded all at once, in a single row. It’s also very disk space efficient.

    However, if there are fields in the user record which are updated frequently, this means reading-in and writing-back the entire user record.

    I wrote another database log function to track which fields in the userid table were being changed frequently (note, when several fields are listed below, it’s because they’re being updated simultaneously):

    Userwr

    The top two are the “wishlist” and “lastnow” fields, but there are lots of others as well. The “lastnow” simply tracks the time you last logged in, and is a fairly minor feature, but it’s responsible for 20% of the writes to the userid table. So, I decided to move that field out to its own table, which should make updating it much more efficient. As to the other fields, well, they’re a lot more work to optimize, as they’re widely used inside BookMooch, so I’m putting rewriting those off for another day.


    That’s it for this blog entry! I hope some of you found this in-depth technical exposé interesting!

  • 23 Responses to “Big BM back-end changes”

    1. Bri said

      I dont think I understood much of that but thanks for all the hard work you are putting in!

    2. Mikko said

      Yes, it’s definitely interesting, as I’ve done some db programming myself. Thanks!

    3. Joe Bryant said

      The stuff about the backend is fascinating. I’m curious: why do you store things in that key/value style? Wouldn’t it be more efficient to update things if each value had its own column, rather than them all being smooshed together like that? I don’t doubt you’ve considered this, I just wondered what the reasoning was.

    4. re: “Wouldn’t it be more efficient to update things if each value had its own column”

      I’m not using a SQL database, I’m using an very simple key/value database called “BerkeleyDB”, so I don’t have multiple columns as an option. Each key can have one value.

      I’m using BerkeleyDB because it’s MUCH faster than SQL. For instance, with my full text search database, I’m able to read 1 million rows in 2 seconds. I’ve written full text searching programs before with SQL, and the much slower performance of SQL servers was a huge problem.

    5. E— said

      I have noticed two minor things about the wishlist notification system that you may like to look into.

      1) I am receiving the occasional related wishlist notification after I have mooched that particular book. After I received the related notifications, I double-checked my wishlist only to find no listing of the books.

      2) I am receiving related wishlist notifications that are not labeled ‘related wishlist’. It seems to work out to the ration of for every 4 of the relateds, I received one incorrectly labeled an exact match.

      Thank you so much for all the hard work you do on this site. We are so grateful for the use of it.

    6. I found it interesting. Thanks.

    7. Trish said

      I don’t understand a lot of this technical stuff (okay, I don’t understand most of it!) either John, but do want to say thanks so much for all your work! (When I want to buy a book from Amazon, I go through BookMooch as a more practical way of saying thanks …)

    8. Mark Williams said

      Thanks for the education on how the search works, very interesting.

      In my first hour of testing, I’m finding the new search much improved, and just as quick on my end. Some of those ‘noise’ terms are indeed useful, so I’m glad you’ve kept them in for now.

    9. Heather19 said

      Wow! I, too, didn’t understand most of the technical stuff, but I understood enough to know that you are AWESOME for doing all of this and working so hard to figure all this stuff out!

    10. Bookbear said

      I don’t know if this is a related bug, but…

      Today I got a Related Wishlist notification for a book I’ve never heard of – Cathouse by Ing.

      There are no books on my Wishlist by (Dean) Ing. There are no books on my Wishlist with the words cat or house in the title.

      The only indirect relationship I can find is that I do have the book _Man-Kzin Wars X: The Wunder War_ by Hal Colebatch on my Wishlist. The relationship being that they’re both books in a ‘shared universe’ series.

    11. re: “Today I got a Related Wishlist notification for a book I’ve never heard of “

      Sometimes, the related editions aren’t obvious, nor all that correct, and I’m sure that the related book mention you got was because the books are in the same series. That’s just a function of the not-so-great data that is out there, and it’s occasionally wrong.

      I’m going to add a feature that shows you in the email what book BM thinks is related to this book we’re telling you about.

      -john

    12. re: 2) I am receiving related wishlist notifications that are not labeled ‘related wishlist’. It seems to work out to the ration of for every 4 of the relateds, I received one incorrectly labeled an exact match.

      I’ve gotten a few reports of this, but I haven’t been able to recreate it. So, if it happens to you, email me and tell me your BM username, and the book that is related but not labeled as such.

      -john

    13. dustpuppy said

      Hi John,
      regarding the new API (which is great btw, I can’t wait to try it out in some fun project), I just looked at the ‘userid’ API and I like it but I’m a little uncomfortable with the ‘searchlog’ key.

      I think most people would assume that what they search for on bookmooch or any other site is private information, and would not want it to be available to the world via a public API. Imagine someone searching for medical information on a particular disease, for example.

    14. Maria said

      Thanks for all the hard work you’re putting in to make the site better. Oddly enough, I just tried the search feature about 30 minutes ago and was very disappointed with the results. For example, searching for P.D. James brought up page after page of books for authors named “James” (first or last name). I paged through for a while, but soon got fed up. P was still pretty far off.

      I just tried it again without the periods (in other words PD James) and it worked much better. I guess I need to know the tricks of using search.

    15. Brian said

      Only seeing Topics that actually have books listed is a huge improvement! Now there’s just one page of Topics instead of tons (I don’t think I ever got to the end). Thanks. I didn’t realize before that most Topics were actually empty, and didn’t realize the book count was only rows in the database. I figured BM had zillions of books available! 🙂

      Another Topics improvement might be to have hierarchical topics if possible. I’m intersted in colonial American history, but the most narrow filter I can find is just “History”, with page after page after page of books. Without front covers, the long lists of just authors and titles text is difficult to parse visually.

      Could Topics have sub-topics that allow me to narrow down to a manageable list size of just books on a specific niche topic?

      Your thoughts on reworking the search index are interesting too. I’m contemplating tweaking a shareware search engine on one of my sites to reduce the size of the index. My first thought was to not index common words – but which words are most common, and are they truly unimportant? I’ll try to figure out a way to count the individual words in my index to see if some can be thrown away, or if they’re important and need to be kept. I’ll also have to make sure I ingore the same words in user searches that I ignored when building the index; so searches like “The History of Science” won’t return no results if my index doesn’t include the words “the” or “of”, but does contain “History” or “Science”.

    16. FJBryan said

      The “noise” words you indicate are known as “stop” words by the Library of Congress. Each language has them (der/die/das, mit, von in German; la/le, avec, de in French; the, and, of in English) and LC doesn’t catalogue books using them–you skip past those words to go to main title words. Rather than have to figure out what are stop words (a task that will only get worse as more books are posted in more languages) you might want to hit up some LC source to get a complete list of all stop words in all languages.

    17. Jennifer said

      My first thought was the same. All of my books I no longer need, as we all grow and change and go through phases in life, have all found happy homes and this process continues. I have found fantastic books, many needed for school, some for my own self-education and leisure and even a few gifts. This whole concept is brilliant and the work you put behind it is a beautifully appreciated passion.

      Many, many thanks!

    18. Brian said

      I just updated my site search index to remove common words. My site consists of about 500 prose articles as opposed to book titles, so there are plenty of unimportant common words.

      Unfortunately, the most common words were also the shortest words (the of and to a in for that on is with this as he by it be we his has – etc.)! So each word removed from the index is not a huge savings. For example, I only saved about 61KB by removing every instance of “the”, and about 21KB by removing “of”.

      Even so, my search index is about 30% smaller now.

    19. Zjanette said

      Just one little question. I love to browse topics and I really would love to have a topic with “cook books” or “food”.
      I know you can’t have 100.000 topics but would it be possible to add just this one?

    20. Taneli T said

      Advanced search has a bug (I think it’s been there even before this update):

      1. Search for anything that has more results than fits on the first page (by default, more than 50 results) AND limit that search with Added recently: 1 days ago.

      2. The first results page shows those 50 books, but when one clicks Next page, the second results page loses the “added recently: 1 days ago” setting and lists all the books in the database that contain the searched word.

      For example: search the word “body” with added recently: 1 days ago. The first page is this:
      http://www.bookmooch.com/m/s/body/-/0/50/0/-/-/-/-/-/a/-/-/-/1

      But when clicking Next page, the second page is this:
      http://www.bookmooch.com/m/s/body/-/1/50/0/-/-/-/-/-

      As you can see straight from the url, some settings are missing on that second page. At this moment the first book on that second page is this http://www.bookmooch.com/m/detail/0708987281
      and latest edition of it was added 2007/06/25, so the second page doesn’t honour the added recently setting.

      Thanks.

    21. Brad said

      Would it be feasible to make the option to turn off the related edition notices for each book individually? I have some books that I only want a certain edition and would like to turn it off, but I have others that I’ll take any edition I can get, so I’d want it on.

      Also when I remove books from the 2nd page of my wishlist, the first page reloads, so I have to go back to the 2nd page manually.

    22. james said

      Ever think of using something like Hadoop for a distributed key-value architecture?

    23. Tricia Miller said

      Two comments
      1- I often do not receive a notification when a book I have wishlisted is available.
      2- I have a number of books for which I only want one edition of the book. I receive numerous “related edition available” notifications for these. Can I make it stop?
      3- Is it possible to add to the rules that if the edion of the book you post is listed in the database, it is “illegal” to pick a different edition, and write in the comments that this book is a diff edtion from the one listed? Many people do this, and I have ordered the wrong books, and often have to weed through a number of books that are incorrect and find the one that is.
      4-Is it possible to enable the “search” feature on the forums?
      5-Dog gonnit! Just recently I added books to bookmooch and paperback swap. In 3 cases, someone from paperback snatched the book up right away, and I accepted their request. Then, a day goes by in which I have forgotten to remove the book, and I receive a bookmooch request. I decline and get negative feedback. I realize that this is an option so people dont steal points by posting books they dont have, but givagirl a break 🙂
      5- Would it be possbile to add a feature that enables us to remove a number of books from our wishlist at one time? I often add books to my wishlist with the all related edtions option. If I receive the book, I think have to go delete the wishlist extras. If they’re on the second page of my wishlist, then I have to
      go to wish list
      scroll down
      click the button to move to page 2
      scroll to find section of list with copies of the same book
      select one of these books
      delete
      Return to page one
      scroll down
      etc etc. I am guessing you get the picture.

      So! Thank you for all you do. I love that I can find the books I want so often!!

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s

    %d bloggers like this: