Posts tagged search

Over the past few days I’ve been doing some serious brain work about Jerome and how we best build our API layer to make it simultaneously awesomely cool and insanely fast whilst maintaining flexibility and clarity. Here’s the outcome.

To start with, we’re merging a wide variety of individual tables1 – one for each type of resource offered – into a single table which handles multiple resource types. We’ve opted to use all the fields in the RIS format as our ‘basic information’ fields, although obviously each individual resource type can extend this with their own data if necessary. This has a few benefits; first of all we can interface with our data easier than before without needing to write type-specific code which translates things back to our standardised search set. As a byproduct of this we can optimise our search algorithms even further, making it far more accurate and following generally accepted algorithms for this sort of thing. Of course, you’ll still be able to fine-tune how we search in the Mixing Deck.

To make this even easier to interface with from an admin side, we’ll be strapping some APIs (hooray!) on to this which support the addition, modification and removal of resources programmatically. What this means is that potentially anybody who has a resource collection they want to expose through Jerome can do, they just need to make sure their collection is registered to prevent people flooding it with nonsense that isn’t ‘approved’ as a resource. Things like the DIVERSE research project can now not only pull Jerome resource data into their interface, but also push into our discovery tool and harness Jerome’s recommendation tools. Which brings me neatly on to the next point.

Recommendation is something we want to get absolutely right in Jerome. The amount of information out there is simply staggering. Jerome already handles nearly 300,000 individual items and we want to expand that to way more by using data from more sources such as journal table of contents. Finding what you’re actually after in this can be like the proverbial needle in a haystack, and straight search can only find so much. To explore a subject further we need some form of recommendation and ‘similar item engine. What we’re using is an approach with a variety of angles.

At a basic level Jerome runs term extraction on any available textual content to gather a set of terms which describe the content, very similar to what you’ll know as tags. These are generated automatically from titles, synopses, abstracts and any available full text. We can then use the intersection of terms across multiple works to find and rank similar items based on how many of these terms are shared. This gives us a very simple “items like this” set of results for any item, with the advantage that it’ll work across all our collections. In other words, we can find useful journal articles based on a book, or suggest a paper in the repository which is on a similar subject to an article you’re looking for.

We then also have a second layer very similar to Amazon’s “people who bought this also bought…”, where we look over the history of users who used a specific resource to find common resources. These are then added to the mix and the rankings are tweaked accordingly, providing a human twist to the similar items by suppressing results which initially seem similar but which in actuality don’t have much in common at a content level, and pushing results which are related but which don’t have enough terms extracted for Jerome to infer this (for example books which only have a title and for which we can’t get a summary) up to where a user will find them easier.

Third of all in recommendation there’s the “people on your course also used” element, which is an attempt to make a third pass at fine-tuning the recommendation using data we have available on which course you’re studying or which department you’re in. This is very similar to the “used this also used” recommendation, but operating at a higher level. We analyse the borrowing patterns of an entire department or course to extract both titles and semantic terms which prove popular, and then boost these titles and terms in any recommendation results set. By only using this as a ‘booster’ in most cases it prevents recommendation sets from being populated with every book ever borrowed whilst at the same time providing a more relevant response.

So, that’s how we recommend items. APIs for this will abound, allowing external resource providers to register ‘uses’ of a resource with us for purposes of recommendation. We’re not done yet though, recommendation has another use!

As we have historical usage data for both individuals and courses, we can throw this into the mix for searching by using semantic terms to actively move results up or down (but never remove them) based on the tags which both the current user and similar users have actually found useful in the past. This means that (as an example) a computing student searching for the author name “J Bloggs” would have “Software Design by Joe Bloggs” boosted above “18th Century Needlework by Jessie Bloggs”, despite there being nothing else in the search term to make this distinction. As a final bit of epic coolness, Jerome will sport a “Recommended for You” section where we use all the recommendation systems at our disposal to find items which other similar users have found useful, as well as which share themes with items borrowed by the individual user.

  1. Strictly speaking Mongo calls them Collections, but I’ll stick with tables for clarity []

Hot on the heels of my Staff Directory Search, I decided this lunchtime to make it even more awesome. “But wait Nick!” I hear you cry. “How can you make it even more awesome than it already is?”

Well, how about avatars for everybody (using the awesome Gravatar system), adding tel: links to phone numbers so users with compatible software can dial with one click, kitting out everything with semantic markup using the microdata spec and adding OpenSearch descriptions?

I’ve had a couple of people ask how my lunchtime project today actually works behind the scenes, so here’s the lowdown in easily-digestible speak. I should point out that I am relying heavily on two frameworks which we’ve already built at Lincoln. These are Nucleus – our heavy-lifting data platform – and the Common Web Design – our web design and application framework. These two gave me a massive head-start by already doing all of the hard work such as extracting data from our directory and making the whole thing look great. Now, on with technology.

(more…)

Today on my lunch break I enjoyed a creamy chicken and broccoli bake from the Atrium, and then decided to muck about with code and see what I could come up with.

Turning to the power of Nucleus, our one-stop-shop for data, tying it together with the Sphinx search engine we’ve used previously in Jerome, and wrapping it all up in the CWD I was able to come up with a glossy version of our existing Staff Directory.

Give it a whirl at http://phone.labs.lncn.eu.

Staff Directory v2, complete with Bakelite rotary dial phone.

Take a look at this website.

This is our current staff directory search. It’s very simple, does more or less what it says on the tin, and looks like it’s been thrown together in Front Page in about a minute. Whilst normally I’m not a massive fan of websites which are chock full of unnecessary imagery, there are some situations where you just have to go “you know what, why can’t we make this look good and be functional at the same time?” So I duly busted out my CSS-fu and had a go.

It loads just as quickly, it does exactly the same thing, but it looks so much nicer. It also avoids words like “criteria”, which sound a bit over the top for native English speakers and just confuse the hell out of those who have it as a second language. Behind the scenes it also avoids a load of the crap which Microsoft decides is necessary to build a web page (apparently JavaScript is a necessity to submit a POST form), and since it uses the CWD it is mobile ready without any extra effort.

You may never see the result of this quick hack I did today, but I really hope you do.

Today and yesterday, between various other things, I’ve had a crack at solving Universal Search (mentioned previously), and I’ve currently got a bit of a mockup going at Labs. At the moment it’s only searching a very, very small set of test data but the entire supporting architecture is there; this is a working system, it just needs the content and a bit of polish on the front-end to handle some edge-case errors and IE6 (predictably). Obviously there’ll be some documentation on how to provide search sources and how to actually write some badass queries, but the majority of it is done.

It's the Universal Search!

The next step is to start plugging this in to external services. I’m going to start with the Library catalogue search from Jerome, making use of Sphinx’s distributed indexing model to avoid the need to double-up on data. Over time new services will be added, either sporting their own Sphinx search endpoints (the preferred way) or by outputting XML files for indexing by Universal Search itself.

Oh, and of course it comes with an API endpoint.

During my time as a student I was faced with a great many challenges which involved some form of searching. Lots of it was academic, relying on the Library, Google Scholar and (gasp, horror, revoke his grades) Wikipedia. However, a lot of it was about more mundane stuff. Where exactly is AR1101? Who is John Smith, and what’s his phone number? What’s for lunch today?

The problem was that to find this information, you already had to know where to find it. Maps of the University are available on the Portal… if you know where to look. Phone numbers can be looked up… if you know the address for the service. The weekly menu is on the Portal… if you know where to look.

We’re left with a simply astonishing number of things which people may want to know about, but which is locked away as an image of a screenshot embedded in a Word document stored 14 levels down in Portal behind a page which nobody has access to, unless you happen to have asked for it. Rooms, events, books, journals, the Repository, blogs, people, news posts, lecture notes, the weekly menu and more are all available somewhere within the depths of a system. So, in traditional Nick fashion, I spent a few minutes in the shower this morning working out how to fix it whilst being refreshed by some particularly minty shower gel.

(more…)

Part of my remit as one of the Online Service Team’s tame students is to take time now and then to step back, look at things, and work out how they could be made all-around better. An example of this has been the slow but steady march towards a common, uniform, standards-compliant styling for every web service.

All my rambling aside, I spotted a brilliant post from the BBC Internet Blog on searching the BBC. In short, their new Search+ trawls the entire BBC looking for what you’re after, and then decides what’s most relevant within context. Data representation and organisation is a big area of interest for me (the Cybernetics part of my degree has a huge focus on knowledge representation), and searching is an area in which the University, to put it bluntly, sucks.

Bits and pieces work on their own, for example the Library Catalogue searches the library fairly well, and the Phone Search tends to find who you’re looking for. Blogs has a search, although it does skim over a few things. There’s also Portal, which has a search function which alternates between sometimes giving you something relevant and sometimes picking random, outdated and irrelevant content from 5 years ago.

What’s needed is something a bit like the Awesome Bar in Firefox, simultaneously looking at a myriad of sources to find something relevant and presenting it to the user. In short, a single box in which you could type “Portal” and find the Portal, or “Nick Jackson” and find my directory entry, or “Somerville” and find his book on software engineering, or “help” and be taken to our support pages. Something which simultaneously scrubs across any data source we care to let it at, returning data as fast as possible.

Thoughts? Opinions? Do you want a single ‘search the University’ box with options to narrow your search, or would you prefer to have to start by specifying what you’re after?