Part 3 of the Evolving Government series with Dcode42
The promises of insights and efficiency from machine learning are large. Data scientists across the government are building models to predict everything from weather patterns to social unrest in foreign countries and insider threats to disease outbreaks. The unheralded first step in this process is preparing training data for machine learning algorithms. Without it, the machines don’t have anything to learn from.
Preparing the data, however, is tedious work. A recent episode of HBO’s Silicon Valley parodied a guest lecturer at Stanford trying to trick his class into labeling data for his algorithm. Hand annotation is one solution. Amazon’s Mechanical Turk is one of several services where you can hire people to label your training data. This works well when the task are straightforward for a human but still hard for a computer.
The tasks can be hard for humans too. This is often true in the public sector where the information and labels the government needs for its predictions often require experts to get it right. In cases where the labels require nuanced knowledge organizations usually do the work themselves. “When I was a fellow at Harvard’s Institute for Quantitative Social Science we had a great labeling resource for our tough, multilingual data sets: undergrads,”said Patrick Lam, the lead data scientist at Thresher.
But often in government agencies, there are only a few experts who have the skills to understand nuanced data, or often times the data is too sensitive to outsource for support. A D.C.-based startup called Thresher aims to tackle this challenge. “Our clients are not trying to distinguish between a Chihuahua and a blueberry muffin,” noted Dr. Evann Smith, a senior data scientist at Thresher. “They need labels that teach computers the difference between nuanced concepts, such as news vs. propaganda in multiple Arabic dialects, nuclear waste vs. weapons discussions on social media, and hereditary vs. acquired myopathies based on doctors’ clinical notes. Getting these labels wrong can drastically change the outcomes of government data science efforts.
Thresher’s approach is to use keywords as proxies for labels. Keywords work well because they can be used to label many documents and provide a transparent method for checking the labels. The problem with keywords is that people aren’t very good at coming up with them, because human language is hard. We invent language all the time to be funny, shape the narrative or hide. Thinking of all slang, idioms and codewords in all languages is impossible.
Thresher solves this problem for its clients with some natural language processing and machine learning methods created when Thresher’s chairman and Harvard professor Gary King reversed engineered Chinese government censorship of Chinese social media. Chinese netizens were using codewords to hide from the Chinese government. When King’s researchers went to label the data for modeling, they were missing key conversations. Thresher’s methods surface these rare words for experts to evaluate and use them to label their training data.
Thresher’s approach to using machine learning to create better data for machine learning is a bit meta, but it’s been effective in a variety of industries, including government. They have labeled data for financial service companies trying to better predict retirement and wealth creation events and for research companies trying to segment people into categories just based on the language they use.
See Thresher and other AI tools impacting government
Dcode42’s program on bringing artificial intelligence to the public sector has expanded Thresher’s understanding of the complicated data sets the government needs labeled and the opportunity for Thresher to help, from health care to homeland security. See Thresher and other Dcode42 company’s at the Artificial Intelligence and Machine Learning Government Demo Day, on July 25 to understand how these types of solutions can solve many mission critical agency problems.