The National Science Foundation is testing a creative mix of machine learning, blockchain technology and data science to tackle a stubborn challenge: How to better evaluate more than 60,000 grant applications it receives each year.
For NSF CIO Dorothy Aronson, the experiment — which involves giving documents the equivalent of “digital fingerprints” — represents more than just finding a new way to solve an old problem. It also offers a glimpse of how the expertise of data scientists, together with modernized technologies, can potentially accelerate the government’s efforts in identifying and funding innovative ideas.
“I’m fantasizing here, but if you’re a citizen, you could send any proposal for innovation to a single [government] location and it would be automatically distributed, or suggestions would be made, to various federal organizations,” she said. “We’re just starting off with little experiments. But there’s a lot of benefits that could come from this.”
That vision may seem like a pipe dream for the thousands of scientists, engineers, academic institutions and entrepreneurs who apply each year to 26 federal grant-making agencies and more than 1,000 programs nationally. But for NSF, which accounts for roughly 20% of federal support to academic institutions for basic research, the technology foundation to build on that vision is now largely in place, according to Aronson.
Capitalizing on modernized IT
The challenge facing NSF and other grant-making agencies — including the National Institutes of Health, which is working with NSF on the project — is how to share and compare proposals without exposing private information and potentially valuable ideas to the public.
“We often get multiple ideas that have a very high degree of similarity and we don’t want to fund that same idea multiple times,” explained Aronson. “So we have to work hard to figure out where there’s duplication.”
Streamlining the workload of program officers and automating as much of the proposal process as possible is “a perennial objective,” said Aronson. “This is not a new problem. It’s just one that we didn’t have the opportunity to solve before.”
What has changed, she said, is a combination of newer technology capabilities and “the availability of people who know the tools used by data scientists in order to create and apply the concepts [we needed] to compare the proposals.”
NSF already had the technology in place to convert PDF and other types of documents into machine-readable formats. And about five years ago, it had begun investing in an enterprise data warehouse, which provides greater flexibility than traditional transactional database systems.
“That was a very important transition for us because it put the data more closely into the hands of the customers,” said Aronson. “They didn’t have to know tools like SQL, for example, to derive the information they needed. And that allowed us to go to things like dashboards for executives.”
Since then, NSF also began gradually moving much of its IT operations, including the enterprise data warehouse, into a cloud environment. That’s helped give NSF the ability over the past two years to capitalize, for instance, on language processing capabilities and artificial intelligence, according to Aronson.
All of those technology upgrades, however, weren’t enough to solve a critical problem: How to simultaneously mask the content of proposals, for privacy protection, and still give program officers a form of visibility into how one grant proposal was similar or different from thousands of others.
What NSF ultimately needed, explained Chezian Sivagnanam, NSF’s chief enterprise architect, was a way to create a mathematical abstraction, or “fingerprint,” for each proposal that could be compared to other documents, each of which typically runs 15 pages in length and includes a variety of images. However, that fingerprinting process also has to work in a way that can’t be reversed engineered, in order to prevent someone from exposing the underlying content.
Data scientists to the rescue
Aronson said the big “breakthrough for us” occurred when NSF began “bringing people into our organization who understood how to apply data science principles and look at problems through the lens of a data scientist. It opened a wider world to us because of the knowledge base of what other data scientists had already created.”
That led to discussions with a number of data science experts and a variety of testing sprints, according to Sivagnanam. “Luckily, we found an algorithm being used within NIH, called Word2Vec, that basically converts document content into numbers. It then applies mathematical statistics on top of these numbers and looks for cosine similarities to relate one document with another,” he explained.
There was just one more catch, he said. “Once we converted the documents, we needed to put the results in a common infrastructure.” That’s where the capabilities of a blockchain and a distributed ledger entered the picture.
“The idea is to score each proposal [for similarities] then build this ledger, scale this up, so that going forward, every grant that comes in through all these organizations will get in this ledger,” Sivagnanam said. “The ledger has the analytical capability that when it finds a close scoring match, it then automatically triggers an alert to the respective program officers on both sides saying, ‘Hey, there is something similar between these proposals. You may want to talk about them further,’ as a pre-conditional exercise.”
NSF’s blockchain development work gained traction early last year from the General Services Administration’s 10x program — a kind of incubator investment fund that supports promising technology projects that can scale across the federal government or improve public services.
“The blockchain part is important,” Sivagnanam said, “because the proposals may come in two or three years after one another. So you need to make sure there are immutable records…with the blockchain monitoring what historically has happened.”
Aronson made it clear, the process, for now, is still in a “manual” mode. “We’ve got the logic together that allows us to create the fingerprints. And we’re able to identify potential overlap. But in order to make it a real-time capability, we need something that will automatically convert a proposal to a fingerprint, add it to the blockchain and then communicate to everybody else on the chain — so it would be a constant conversation,” she said.
“Because we really didn’t have everything in place two or three years ago, this idea really wouldn’t have been actionable,” she said.
She now believes the time is no longer far off when, if a proposal comes to NSF, but is better suited for the Department of Energy, it could be routed and flagged within minutes instead of months. “It would allow us as a federal government to be more efficient at our distribution of innovative proposal ideas.”