Posted by cmotuz on January 20, 2012
So far, the SIMSSA project has focused on developing OMR search systems and testing them on relatively small documents. Even the very large Liber Usualis, with its 2340 pages, however, is very small when compared to an entire collection of digitized music scores. IMSLP has 157,490 scores and counting, and many libraries from around the world are joining the trend of digitizing their collections. Based on the RISM catalogue and logical extensions, I would estimate there to be around 9 million scores in existence, ranging from 1-1000 pages each, for a total of 100-200 million pages. As our ultimate goal is to be able to do such large scale searches, we have spent some time in the past weeks to considering the problems of processing power that these are going to pose.
Most of us are used to search times measured in hundredths of a second—Google just showed me sumgly that my search for “Liber Usualis” found “about 108,000 results (0.31 seconds)”—and, if you’re not reading this from a vintage computer, then you’re probably used to processing times of under a minute. If you search the Liber online, a search will typically last a few seconds. This search time is multiplied not only by the number of pages, but by the complexity of searches. Once we branch out from monophony into polyphony, for instance, simple pitch string searches can increase in complexity at an exponential rate. This means that processes which last seconds now could last hours or even days when applied to collections of polyphonic music. Finally, the ultimate search system would not refer the user to a set of pre-indexed values, but would index each time a search is made, adding complexity but making it possible to come up with searches that involve information for which no index exists.
So what can we do? Alastair and Andrew have been working with a system called Hadoop MapReduce™, which allows searches to be parceled out and processed by many machines working in parallel. In order to simulate a large corpus, they multiplied the pages in the Liber Usualis by 100, so that the computer had to search over 200 000 pages. A complicated search that took 31 seconds on one instance of the Liber now took an unwieldy 53 minutes. The use of parallel processing had the effect of drastically reducing processing time: this same search running over 20 computers took just over 4 minutes. In combination with increasingly clever search algorithms as well as the continuation of Moore’s Law, the use of parallel processing will probably meaningful searches over a large body of works a reasonable task, if not next week, then by the time we’ve developed the OMR technology to make such searches possible in the first place.