Tech

DOE Develops ‘Superfast’ Search Engine

June 6, 2011

To better help them sift through gigantic scientific databases, computer scientists at the Department of Energy’s Lawrence Berkeley National Laboratory have developed an open source solution called FastBit that is 10 to 100 times faster than its commercial counterparts, depending on the type of searching task.

Jon Bashor of the LBL explains:

This kind of analysis calls for an approach fundamentally different from that of an Internet search engine or a typical commercial database.

Google and Yahoo! have developed massive infrastructures to make keyword searches quick and easy. However, their search engine techniques are not appropriate for analysis tasks where most of the data records are represented by numerical values instead of text, and most search operations require full and complete answers instead of some “Top 10” records.

Commercial database systems, meanwhile, are generally designed to manage a relatively small number of searchable attributes. For example, a banking application might only be able to search for accounts based on account number and customer name. In addition, commercial database management systems are normally designed to locate an individual record (or a very small number of records) efficiently, while most scientific data analysis tasks require a much larger number of records. Furthermore, scientific data analysis requires flexibility: researchers will generally wish to examine many different scenarios or combinations of conditions and attributes.

So how did they go about fixing the problem? Bashor explains:

FastBit organizes data into formats known as “Bitmap indices.” Bitmap indices translate variable values into strings of bits, or 1’s and 0’s. Bitmap indices tend to be very efficient because computer processors are optimized to perform so-called logical operations on bits. Typically, however, bitmap indices have been used where variables have what is called low “cardinality”—that is, a limited number of possible values. Examples would include the gender or the state or zip code of the customer in the database; there are only so many genders, states, or even zip codes.

Scientific data, by contrast, typically has an enormous range of values, so further techniques for developing the index were needed. These included an alternative method for partitioning or organizing the data; innovative ways of encoding the indices; and a revolutionary patented data compression system.

Typically, commercial database programs partition or split up data by record or groups of records. A record might include the following variables: customer name, address, phone, account balance, and date and amount of last payment. That is known as “horizontal” partitioning. (Imagine the record as a horizontal row in a spreadsheet with customer name followed by the rest of the variables.) By contrast, in the STAR application, there are billions of events (records) stored, each of which has multiple variables. But searches are usually looking for just a few variables. It would be enormously time-consuming to call up billions of whole events or records with all their variables when searching for just a few variables. So rather than partition the data by events or records, FastBit partitions data by variable—so-called “vertical” partitioning. This cuts down enormously on memory overhead and speeds processing.

In addition, FastBit provides multiple nested levels of encoding, with the top level providing a relatively coarse index to the data and each successive lower level providing finer detail. In effect, the top level indices provide pre-computed answers to anticipated queries. This enables a rapid narrowing of the search as the software zeros in from a general picture to ever more precise detail.

Finally, FastBit’s authors devised an ingenious, patented method of compressing the bitmap indices that enables rapid performance of logical operations simultaneously on large swaths of data.

Read Bashor’s full report and more about the consumer uses of the technology.

DOE Develops ‘Superfast’ Search Engine

More Like This

ICE pursuing privacy approvals related to controversial phone location data

House Modernization panel advances bill to improve CRS’s data access in first-ever markup

404 page: the error sites of federal agencies

Top Stories

GSA welcomes nominations for advisory committee focused on federal transparency efforts

Security flaws in IRS systems pose risk to financial statements, GAO says

DHS launches safety and security board focused on AI and critical infrastructure

DOJ seeks public input on AI use in criminal justice system

DHS picks OMB official to lead its new AI Corps

CISA’s chief data officer: Bias in AI models won’t be the same for every agency

Scientists must be empowered — not replaced — by AI, report to White House argues

GSA taps Login.gov deputy director to take top role next month

More Scoops

Machine-learning models predicted ignition in fusion breakthrough experiment

DOE uses firmware machine learning to bolster electric grid cybersecurity

Argonne National Lab adds ‘AI supercomputer,’ boosting work of COVID-19 consortium

Inside the HHS system informing White House coronavirus decisions

Federal CIOs directed to tag coronavirus announcements for search engines

Supercomputing consortium adds members as number of coronavirus projects increases, too

Hybrid cloud is hard — but worth it in the long run, feds say

Latest Podcasts

DOE Develops ‘Superfast’ Search Engine

CISA is building an automated ransomware warning program

GSA’s Login.gov platform gets a new director

DOD’s Ashley Elizabeth Evans on effectively implementing AI

Tech

Defense

Cyber

Acquisition