Thursday, March 17, 2011

Crowdsourced data is not a substitute for real statistics

Guest Beneblog by Patrick Ball, Jeff Klingner, and Kristian Lum

After the earthquake in Haiti, Ushahidi organized a centralized text messaging system to allow people to inform others about people trapped under damaged buildings and other humanitarian crises. This system was extremely effective at communicating specific needs in a timely way that required very little additional infrastructure. We think that this is important and valuable. However, we worry that crowdsourced data are not a good data source for doing statistics or finding patterns.

An analysis team from European Commission's Joint Research Center analyzed the text messages gathered through Ushahidi together with data on damaged buildings collected by the World Bank and the UN from satellite images. Then they used spatial statistical techniques to show that the pattern of aggregated text messages predicted where the damaged buildings were concentrated.

Ushahidi member Patrick Meier interpreted the JRC results as suggesting that "unbounded crowdsourcing (non-representative sampling) largely in the form of SMS from the disaster affected population in Port-au-Prince can predict, with surprisingly high accuracy and statistical significance, the location and extent of structural damage post-earthquake."

One problem with this conclusion is that there are important areas of building damage where very few text messages were recorded, such as the neighborhood of Saint Antoine, east of the National Palace. But even the overall statistical correlation of text messages and building damage is not useful, because the text messages are really just reflecting the underlying building density.

Benetech statistical consultant Dr. Kristian Lum has analyzed data from the same sources that the JRC team used. She found that after controlling for the prior existence of buildings in a particular location, the text message stream adds little to no useful information to the prediction of patterns of damaged building locations. This is not surprising, as most of the text messages in this data set were requests for food, water, or medical help, rather than reports of damage.

In fact, once you control for the presence of any buildings (damaged or undamaged), the text message stream seems to have a weak negative correlation with the presence of damaged buildings. That is, the presence of text messages suggests there are fewer (not more) damaged buildings in a particular area. It may be that people move away from damaged buildings (perhaps to places where humanitarian assistance is being given) before texting.

Here's the bottom line: if you have a map of buildings from before the earthquake, you already know more about the likely location of damaged buildings than if you relied on an SMS stream, based on the Haiti data presented. That is, to find the most damaged buildings, you should go to where there are the most buildings! The text message stream doesn't help the decision process. Indeed, it would seem to be slightly more likely to lead you to areas that have fewer damaged buildings. Crowd-sourcing has many valuable uses in a crisis, but identifying spatial patterns of damaged buildings isn't one of them.

14 comments:

cathbuzz said...

Good post, and I agree with your assessment. There are many reasons that text messages are not a good indicator of damage or other problems at a disaster site. For example, some areas may not be on the grid any longer after a disaster--we have seen this issue with the recent Japanese tsunami. Additionally, no text messages could indicate no survivors or no survivors that use that technology. Thanks for the post!

David Sasaki said...

I appreciate this analysis, and especially the links, but it seems to me that the title is both an abstraction and distraction from the issue at hand. What I still don't understand is how one distinguishes between "crowdsourced data" and "real statistics."

If I understand this blog post correctly, it points out that while text messages submitted to Ushahidi correlate to buildings that were destroyed in the earthquake, text messages do not correlate to buildings that were destroyed in neighborhoods were no text messages were sent.

I think that is a fairly obvious observation that we can all agree on. If we want to get data to relief workers about where to prioritize their efforts, it seems that text message reports are just one stream of data that needs to be combined with the inspection of satellite imagery (ideally by many people, ie. "crowdsourcing"), monitoring of media reports, and field reports by professional humanitarian workers.

What I don't understand is why the title of this post makes it seem like an either/or distinction when it is a lack of information that is still the greatest challenge to humanitarian efforts.

Jon Gos said...

Great post, thanks for sharing your thoughts. I wanted to chime in to second what David states above. I don't think there's been a scenario we've seen where crowdsourced data was solely relied upon for making critical decisions.

In fact, a number of traditional platforms used by disaster response organizations of all types are seeking methods of incorporating crowdsourced data into their datasets. Crowdsourcing simply provides an excess of data, far more efficiently, and quicker than previous methods of data collection.

With integration into systems like ArcGIS, ESRI, Google Earth, and others, aggregated human volunteered data becomes an additional datasource. So I think the whole industry agrees with the title of this post, that crowdsourced data certainly isn't a replacement for anything, nor is always comprehensive in coverage (but then again neither are methods of polling or surveying). However, it's sometimes a useful addition.

What crowdsourced data does provide is an admittedly chaotic dataset that may contain insight or additional context for what is occurring on the ground as assessed by those who are witnessing the event first-hand. Much like in a legal system, witnesses simply add context (and occasionally confidence) for those who ultimately have to make decisions. And also like in a court, if those decision making humans are doing their jobs, they're taking into account more than just the accounts of witnesses.

In fact, this has been a critical distinction made by many of the qualitative assessments of the emergency related deployments of our platform. You can find all of these collected reports and research papers at http://community.ushahidi.com

Finally, the role visualization tools provide (maps of crowdsourced data) is one of translation: helping the data make sense to people who don't have the luxury of being 'real' statisticians but who may, in fact, be interested in understanding an event on a superficial level. As a non-Ushahidi example, the data visualizations of news organizations like the NewYorkTimes or The Guardian are standard journalistic practice now. Sometimes the visual representation of statistical findings, though flawed, offers a better method of presenting complex datasets to non-mathematicians.

But then again, this entire conversation assumes that 'real statistics' are, or have ever been, infallible. I think the odds are against you on that one. ;-)

- Jon Gosier, Ushahidi

Jim Fruchterman said...

The key question we’re wrestling with is whether a source of information adds significant value over other sources of information, for making a certain kind of decisions. Let’s imagine a difference between little bits of truth and a big overall picture. The first three phone calls I get, say as the fire department dispatcher, after a disaster are each one bit of truth. The first call may be: was that an earthquake? The next may be, there’s a water main broken at State and 4th Streets. The third may be, send a rescue team to 123 Maple Street because a building has collapsed. Each of these bits of truth are true. But, they don’t give me the big picture of the disaster: they give me individual pixels.

But, let’s say that my job is to coordinate overall disaster response. Where do I direct teams to make maximum impact? My goal is then not to simply respond to each bit of information: it’s to try to get a sense of the big picture to make the best decision: let’s say I’m measured on saving the most lives with a limited amount of resource. In my example above, only one of those three bits of information dealt with something that might save a life. And, 123 Maple Street may not have any people trapped and needing saving.

In normal times, the responders doesn’t have a difficult decision. Send a couple of fire trucks to 123 Maple Street, and send the water department to State and 4th. But, in a big disaster, I need more information.

The key point of our blog post is that if you have to react immediately to the Haitian earthquake, you’d be better off with a detailed building map (or maybe a really good set of satellite imagery), rather than the text message stream, to get the big picture of where you would find the densest areas of building collapse. You don't need to crowdsource, you can just look at the map. If you have more time, you can get much better data with the remote sensing approach (satellite or aerial pictures before and after the disaster). Our point is that there is a better, cheaper, easier to find source than the text stream, and consequently, the text stream doesn't add value to big picture decisionmaking.

Why is that? Because putting the source location of a bunch of text messages on a map, only some of which are about life-threatening issues, is not the best predictor of where buildings collapsed, where there was the most severe damage. A building map is comprehensive: it gives you the overall picture of where buildings are. The text-stream is a bunch of pixels that only partially describe the big picture. Each one is true (we’re happy to assume this), but only a tiny slice of the big picture, the patterns we care about.

Jim Fruchterman said...

Continuing on David's question:

More technically, how you distinguish between crowdsourced data and real statistics is that real statistics use a model that adjusts for what isn't in the sample. Real statistics have some way of thinking about what's excluded from the sample. That can be through probability sampling and projection via sampling weights, or through modeling the sampling process via multiple systems estimation, or something similar. The key point is that real statistics enable the analyst to make estimates that adjust for what is not observed (i.e., "unbiased estimates"). Even more importantly, real statistics enable the analyst to estimate the variance of those estimates, what lay people sometimes call the "margin of error."

Without adjusting for sampling bias, all data gathered without a random basis -- including crowdsourced samples -- are subject to many unknown biases. They're called "convenience samples" because they're samples that were convenient to the researcher. They have no necessary relationship with the underlying pattern of reality. Indeed, convenience samples will correlate with reality only by coincidence.

In the case of using text messages to predict where the most buildings were damaged, we found that once you control for a really obvious thing you almost certainly could know easily (where there already are buildings), putting text messages on the map would slightly lead you away from the areas of most damage. Why? We don’t know why, but there could be many reasons. Perhaps knowledge of the SMS short code was imperfectly known throughout the area. Maybe a cell tower was down in the area of greatest damage. Maybe SMS texts asking for water or food outnumbered those dealing with building damage, and were more likely to come from neighborhoods with less damage? We don’t know.

The benefit of crowd-sourced information is not getting the big picture. It’s getting little bits of the big picture, often in areas where you would otherwise get no information (so, we're agreeing on that point). Our advice is not to disregard those little bits of truth: but not weigh them down with expecting them to tell us all or most of the big picture truth.

David Sasaki said...

Hi Jim,

Thanks for the response. It seems to me that you're comparing apples and oranges in your distinction of "crowdsourced data" and "real statistics."

More technically, how you distinguish between crowdsourced data and real statistics is that real statistics use a model that adjusts for what isn't in the sample. Real statistics have some way of thinking about what's excluded from the sample.

This conflates a data input (ie. text messages or location of buildings) with statistical methods like probability sampling. Of course it is possible to apply probability sampling to all sorts of data, including what you call "crowdsourced data." What we're really talking about here is reliability of data - and in this specific post only two sources of data our cited - 1,645 text messages from the Corbane dataset and a GIS dataset of damaged buildings from UNITAR.

Let's take it as given that a map of buildings in Port au Prince is the best initial predictor of where damaged buildings might be found. Are you then suggesting that humanitarian workers should ignore all other types of information, including SMS reports from citizens asking for help? I understand the importance of representative sampling. For example, here in Mexico 60% of Twitter users are said to reside in the capital city. So to treat Twitter as representative of national opinion would be ludicrous. But to ignore Twitter completely because it is not representative seems equally ludicrous.

Finally, I think it is important to add a footnote to the sentence "Benetech statistical consultant Dr. Kristian Lum has analyzed data from the same sources that the JRC team used." From KL's post:

Because the paper does not list the exact boundaries they used to define Port-au-Prince in their data set, I tried to recreate their data set based on the number of events they reported to have included in the analysis and guessing what the boundaries of their plots were by finding landmarks on a map. After many hours of trying to find a subset of these larger datasets to match SMS and building damage data sets used in the above analysis perfectly, I emerged with something that is hopefully sufficiently similar. Although it looks like I cut off a little bit of space over on the right when trying to match their dataset, for all intents and purposes, I think I've got the same thing. They've got 1645 SMS messages, and I've got 1651. They use 33,800 damaged building locations, while I use 33,153. Although the plots that I have reproduced (Figures 3 and 4) are not *exactly* the same as those presented in the paper (above), I think they are similar enough to conclude I am doing the same thing they are given that the datasets are slightly different and some of these plots require some tuning parameters.

Jim Fruchterman said...

Hey, David and John, thanks for continuing the conversation. I'll just keep circling back to our narrow focus: the usefulness of crowd-sourced data, for doing statistics, for finding patterns.

We're focused on a specific claim from Ushahidi about the use of crowdsourced data to do statistics. From the original Ushahidi blog:

"One of the inherent concerns about crowdsourced crisis information is that the data is not statistically representative and hence 'useless' for any serious kind of statistical analysis. But my colleague Christina Corbane and her team at the European Commission’s Joint Research Center (JRC) have come up with some interesting findings that prove otherwise. They used the reports mapped on the Ushahidi-Haiti platform to show that this crowdsourced data can help predict the spatial distribution of structural damage in Port-au-Prince."

Our point is that the cited research does not prove what it claims to prove, and that crowdsourced data is not a great predictor of spatial damage. And, yes, our contention is that crowdsourced data is not all that useful for serious statistical analysis.

The fitness of a particular tool is based on the job you want to do with it. And, its fitness is not an absolute: it's relative to other tools that are available.

A critique of the usefulness of a tool for a particular job frequently gets interpreted as a blanket critique of the tool. Pointing out that a hammer is not great tool for opening beer bottles is not the same thing as saying we should stop using hammers to pound in nails (but have you seen my cool new nail gun?).

So, please do not interpret a narrow and specific scientific critique of a scientific claim about crowdsourcing as a blanket critique. SMS is great for some things, and not so great for other things.

Jim Fruchterman said...

Specifically onto David's Twitter comment: you've actually made our narrow point eloquently here, and I want to emphasize our agreement.

If you analyze the Mexican Twitter feed for comments about politicians, you've learned some interesting things about how some people in Mexico feel about those politicians. To use my earlier language, each of those comments are bits of truth. Some bits of truth are more interesting than others.

You might even do some descriptive statistics: on the Twitter stream, candidate A is mentioned twice as often as candidate B. But, nobody is talking about candidate A's haircut, and hundreds of people mentioned candidate B's haircut. So far, so good.

It's when people leap to a conclusion about a big picture question that you'll hear from scientists like us. In the Mexican political example, there might be somebody who says that because candidate A is mentioned twice as often as candidate B, that means he'll get elected. A ludicrous claim, I'm sure you agree.

And, if I were a pollster (or a political scientist) in Mexico, I'd point out why doing statistics on the Mexico City dominated twitter feed is not as good as doing a real survey of likely voters.

As Jon points out, not all statistics end up being correct. But, they have a very helpful characteristic: they tend to be right about claims (that are amenable to stats) a lot more often than any other way I'm aware of. But, I'd have to do some research to back that claim up statistically!

Jon Gos said...

Jim,

If a tool's fitness cannot be absolute, then neither can it's fallibility.

You give a great example with the hammer and nail-gun. Where your point is that people should only use the best tool for a given job. From that stance then we have to leave people free to discover what the best tools for that job will be. And that means exploring the viability of new tools and methodologies that others will be quick to point out don't work (or aren't as good as their alternatives). Of course this doesn't imply that critique of the method isn't welcome or expected.

In the quote you've included from our site, we cite a statistician who found a correlation between one crowdsourced dataset collected during the 2010 Haiti earthquake and research on the same subject produced with traditional methods. They seem to have found a useful correlation. That's one dataset, around one event. My point above is that even in this scenario, I doubt it was the only source of information used for critical decision making.

The result in this particular case was that crowdsourcing was found to be useful by that group. The question then becomes "for what?". And if the answer is 'real statistics' then I'm sure you're going to have a big problem with that. ;-)

However, I'm not sure where they, or we, ever made the claim that it was the BEST tool for all scenarios -- it's just stated that crowdsourcing was found to be useful. If some are finding it useful, others will continue to investigate exactly how useful and for what.

Thanks for the lengthy discussion, we appreciate the analysis!

Jim Fruchterman said...

"If a tool's fitness cannot be absolute, then neither can it's fallibility."

Wow. This is so wrong and so pernicious, that I'll need more time to fully articulate the ways it's wrong.

Probably worth a fresh blog post rather than a comment. Stay tuned!

David Sasaki said...

One problem I have with this discussion is that it routinely conflates three very different things: data inputs, tools, and analysis methods. As I interpret the original post, it is a criticism of the recognition given to text messages over building coordinates as a reliable data input to measure what geographic areas are most in need of humanitarian relief.

Jim, I still don't understand what it is that you're proposing. Are you suggesting that relief workers in Haiti gave too much significance to text messages and not enough to building locations? That would be an interesting argument if you could prove it, but I think a lot of data is missing.

Given the mobile phone penetration here in Mexico City and the corruption variable when it comes to getting building permits, I hope that relief workers don't ignore SMS messages of people asking for help the next time there is an earthquake here. And I hope that's not your suggestion.

Jim Fruchterman said...

David, will take a fresh tack on this as a new blog post, on why we're discussing statistics and claims about patterns. And why this matters.

But, in short, we're not talking about ignoring SMS messages, any more than we'd suggest ignoring 911 emergency calls.

Going back to the first paragraph of the original post:

"This system was extremely effective at communicating specific needs in a timely way that required very little additional infrastructure. We think that this is important and valuable. However, we worry that crowdsourced data are not a good data source for doing statistics or finding patterns."

Jim Fruchterman said...

More on at this new Beneblog post.

Unknown said...

I have also heard a couple different questions. OP was that Social Media is not a good predictor of building damage. ok.
It _is_ however an awesome predictor of where a human is underneath a damaged building.

As for social media and statistical regression: there are types of electronic interaction that could provide tremendously valuable statistical data, but those would be inherently different than systems designed to be responsive to individual need.

I agree with commenter above who noted that the actual intent of the post, and subsequent conclusions are not logically tied together.

Humanitarians are looking for validation of their premises and toolset. Engineers might be more interested in OpenStreetmap style crowdsourcing.

My disappointment is that the conversation is about passive monitoring of social channels rather than a call to organize the myriad of awesome tools out there into an ecosystem of citizen-centric life-assurance technology.