To better help them sift through gigantic scientific databases, computer scientists at the Department of Energy’s Lawrence Berkeley National Laboratory have developed an open source solution called FastBit that is 10 to 100 times faster than its commercial counterparts, depending on the type of searching task.
Jon Bashor of the LBL explains:
This kind of analysis calls for an approach fundamentally different from that of an Internet search engine or a typical commercial database.
Google and Yahoo! have developed massive infrastructures to make keyword searches quick and easy. However, their search engine techniques are not appropriate for analysis tasks where most of the data records are represented by numerical values instead of text, and most search operations require full and complete answers instead of some “Top 10” records.
Commercial database systems, meanwhile, are generally designed to manage a relatively small number of searchable attributes. For example, a banking application might only be able to search for accounts based on account number and customer name. In addition, commercial database management systems are normally designed to locate an individual record (or a very small number of records) efficiently, while most scientific data analysis tasks require a much larger number of records. Furthermore, scientific data analysis requires flexibility: researchers will generally wish to examine many different scenarios or combinations of conditions and attributes.
So how did they go about fixing the problem? Bashor explains:
FastBit organizes data into formats known as “Bitmap indices.” Bitmap indices translate variable values into strings of bits, or 1’s and 0’s. Bitmap indices tend to be very efficient because computer processors are optimized to perform so-called logical operations on bits. Typically, however, bitmap indices have been used where variables have what is called low “cardinality”—that is, a limited number of possible values. Examples would include the gender or the state or zip code of the customer in the database; there are only so many genders, states, or even zip codes.
Scientific data, by contrast, typically has an enormous range of values, so further techniques for developing the index were needed. These included an alternative method for partitioning or organizing the data; innovative ways of encoding the indices; and a revolutionary patented data compression system.
Typically, commercial database programs partition or split up data by record or groups of records. A record might include the following variables: customer name, address, phone, account balance, and date and amount of last payment. That is known as “horizontal” partitioning. (Imagine the record as a horizontal row in a spreadsheet with customer name followed by the rest of the variables.) By contrast, in the STAR application, there are billions of events (records) stored, each of which has multiple variables. But searches are usually looking for just a few variables. It would be enormously time-consuming to call up billions of whole events or records with all their variables when searching for just a few variables. So rather than partition the data by events or records, FastBit partitions data by variable—so-called “vertical” partitioning. This cuts down enormously on memory overhead and speeds processing.
In addition, FastBit provides multiple nested levels of encoding, with the top level providing a relatively coarse index to the data and each successive lower level providing finer detail. In effect, the top level indices provide pre-computed answers to anticipated queries. This enables a rapid narrowing of the search as the software zeros in from a general picture to ever more precise detail.
Finally, FastBit’s authors devised an ingenious, patented method of compressing the bitmap indices that enables rapid performance of logical operations simultaneously on large swaths of data.