Machine learning is not like blockchain: there’s something useful there, beyond the hype. That’s why we’ve been quietly working on our machine-learning roadmap for quite some time already. We believe that machine-learning technologies can help project and program teams systematically improve the quality of the data they collect, save them time and money, and ultimately improve the decisions that are informed by their data.
Of course, most sophisticated, quality-obsessed data-collectors already use a range of statistical checks to identify potential data-quality problems. And while we’ve tried to make this practice as simple as possible (with our “quality checks“) and organizations like Innovations for Poverty Action have tried to make it as routine as possible (with their “high-frequency checks“), the trouble is that these checks require premeditation, statistical know-how, and manual effort to adapt the checks as data and needs change. Old-fashioned statistical checks are good and helpful, but we believe that the right application of machine-learning technologies can do considerably better – all with less need for premeditation, statistical know-how, or manual effort.
Note that I said “the right application of machine-learning technologies.” Amazon, Google, and others have already made the fundamentals of machine learning readily accessible to the world’s developers, and we could easily integrate some shallow, potentially-flashy machine-learning something-or-another in SurveyCTO, to capitalize on the hype and appear innovative. Instead, we’ve taken a slow, deliberate approach to building up the necessary infrastructure to leverage these technologies in the most effective way, in the service of data quality, and with privacy and data security at the absolute center of every design decision.
Key steps we’ve already taken:
- Humans who understand the context need to be able to look at data and know the difference between good and poor quality. After all, if a human can’t tell the difference, how can a machine? (Spoiler alert: humans need to train the machines.) We’ve actually been working on this problem since the founding of SurveyCTO, collecting information about question timing, random audio recordings, and more, and then making all of the collected data and corroborating quality information easily and safely reviewable. The addition of our Data Explorer in SurveyCTO 2.20 was a major step down this road, and we’ve been steadily expanding users’ ability to quickly and effectively review incoming data, including aggregate views, individual views, and the results of statistical checks.
- Data-quality review processes need to be built into every data-collection project. It’s not enough that we have tools that in theory allow teams to review the quality of incoming data; teams need to actually use those tools in a systematic way. In SurveyCTO 2.40 and 2.41, we made major improvements to our review and correction workflow options, making it easier and easier to systematically review some or all incoming submissions, make comments and corrections while reviewing that data, and smartly flag subsets of incoming submissions to review more closely (using, e.g., random selection and the results of automated quality checks).
- Those who review data have to be able to render a judgment on that data. In order to train machine learning algorithms to proactively identify potential quality problems, those who are reviewing data have to be able to classify its quality. In SurveyCTO 2.41, we introduced a simple classification system for just this purpose. For every reviewed submission, the reviewer must now classify it in one of four categories: Good (no problems found), Okay (minor or no problems), Poor (serious and/or many problems), or Fake (fake or fraudulent responses). While these categories are subjective and imperfect, we believe that they are simple and reasonably universal, and that they will give us the outcome classification data necessary to train potential machine-learning models.
Our next steps:
- Enriching non-PII meta-data with machine-learning algorithms in mind. Some meta-data, like random audio recordings, might be very useful to human reviewers but not very useful to machine-learning algorithms. On the other hand, some other meta-data, like data from device sensors, might be very useful to machine-learning algorithms but less so to humans. We’re actively researching and experimenting with sensor data streams, in order to add new meta-data options to SurveyCTO. We’re particularly interested in ways that we can condense sensor data down into non-PII (non-personally-identifiable) statistics that might help machine-learning algorithms predict the quality of a submission without posing any risk of revealing sensitive data.
- Piloting and experimenting with different approaches to machine learning. While we couldn’t really move forward with machine-learning research and experimentation without the kind of outcome classification we built into SurveyCTO 2.41, now we can. So we’ve begun discussions with key partner organizations about pilot projects that would collect and classify data in a way that would allow us to begin testing our theories about how machine-learning models might be applied. As we expand the range of meta-data available from, e.g., device sensors, those efforts to pilot, research, and experiment will increase.
- Scaling up the most-promising machine-learning technologies. SurveyCTO’s review and correction workflow has already been designed with machine learning in mind, and in fact we even know just where a question like Would you like to use machine learning technologies to help identify submissions requiring closer review? would go. As our R&D and pilots produce promising machine-learning technologies, we plan to roll them out in SurveyCTO updates – and then continue to refine them as growing adoption brings us additional data and learning. Our hope is to accelerate the impact we can have on the quality of data collected all around the world.
As with everything we do, it’s taking us longer to leverage machine-learning technologies in part because of our focus on data security. If you wanted to throw all of your raw data, including photos and audio recordings, into the cloud, pass it along to Amazon, Google, or others to churn through their machine-learning systems, you might be able to begin leveraging machine learning much more quickly. But we believe that your data is too important, too sensitive, and too private to just chuck into the cloud and pass through all kinds of third-party systems. As always, our focus is on minimizing the number of people who have the ability to see sensitive data. And, in this world of machine learning, that means designing systems that allow machine-learning systems to make good decisions about data quality without having to see most of the data.
It’s a tall order, and it’s likely that not a single one of our competitors will even bother trying. They’ll just carry on building systems that allow loads of people – within their organization, in all of the organizations they partner with or utilize, etc. – to see sensitive data. We’ve just never allowed ourselves that luxury, and we don’t plan to start in this new world of machine learning. (Read about our guiding principles…)
It’s taken a long while to build up the quality-control infrastructure to (a) collect the data necessary for machine learning, (b) institute the quality-control processes necessary to give machine-learning technologies a place to slot in, and (c) build up a large, worldwide user base of data-collectors who can benefit from these technologies. But now that we’ve got all that, we’re entering into an exciting phase. Stay tuned to see how it plays out in the coming months…