All of the data in the world—and the amount is growing at a frightening rate—won’t help researchers solve the big problems if they can’t make sense of it. Which is why a team of researchers from Harvard University and the Broad Institute of Harvard and M.I.T. has developed analytical data-mining software that can find an oasis of meaning in a desert of numbers. They’ve used the software to find insights on the socioeconomic impact of obesity, bacteria in the gut and baseball.
How to Find Meaning in a Maelstrom of Data
by Larry Greenemeier
http://blogs.scientificamerican.com/observations/2011/12/16/how-to-find-meaning-in-a-maelstrom-of-data/
All of the data in the world—and the amount is growing at a frightening rate—won’t help researchers solve the big problems if they can’t make sense of it. Which is why a team of researchers from Harvard University and the Broad Institute of Harvard and M.I.T. has developed analytical data-mining software that can find an oasis of meaning in a desert of numbers. They’ve used the software to find insights on the socioeconomic impact of obesity, bacteria in the gut and baseball.
The software teases out relationships among data points (potentially millions of them) and measures the strength of these connections. As the researchers report in a paper appearing in the December 16 issue of the journal Science, most data-mining tools used today can either find correlations between data or determine how solid those connections are—few can do both.
“When we started this project we wanted a way to summarize what was in these datasets in a very simple way, asking what were the variables in these datasets that are most strongly associated,” says David Reshef, a co-first author of the paper and graduate student in the Harvard-M.I.T. Health Sciences and Technology program. “It’s a very simple question but it turned out to be very complicated because variables can be related in lots of different ways and there are various methods for finding different patterns.”
David Reshef—working with younger brother Yakir Reshef, Broad Institute associate member Pardis Sabeti and Harvard computer science professor Michael Mitzenmacher—tested the tool on social, economic, health and political data from the World Health Organization (WHO) and its partners. The data pool was large, covering 200 countries and containing 357 data variables per country, including household income and obesity.
The tool is part of a larger program the researchers call MINE (Maximal Information-based Nonparametric Exploration). It examined every possible combination of variables (more than 60,000 of them) and a list of relationships ranked by the strength of one variable’s statistical dependence on the other (i.e. how much one variable is related to the other).
One identified relationship, for example, was between household income and female obesity. From this pairing, the researchers saw that the data from many countries follow a parabolic curve, with obesity rates rising with income but peaking and tapering off after income reaches a certain level. However, in the Pacific Islands, where female obesity is a sign of status, the rate of obesity followed a completely separate trend from the rest of the countries in the world, climbing rapidly even at low incomes.
The idea is to use MINE to generate new ideas and connections that no one has thought to look for before, says Yakir Reshef, a co-first author of the paper and a Fulbright scholar at the Weizmann Institute of Science in Israel. “The interdisciplinary nature of the project shows to us the widespread application of this work,” he adds. “It doesn’t matter whether it’s global health data, genomic data or Internet search statistics—on some level it’s all the same.” The researchers explain their work in more detail on their Web site and in a video accompanying their paper.
In another test, they took nearly 6,700 pieces of data related to microorganisms that live in the gut collected by Harvard colleague Peter Turnbaugh. The software made more than 22 million comparisons and narrowed in on a few hundred patterns of interest that had not been observed before.
The researchers also tested the software on baseball. They found that the statistics that most related to a player’s salary were hits, total bases and an aggregate statistic that reflects how many runs a player generates for a team. During the 2008 season the Tampa Bay Devil Rays, Atlanta Braves and current world-champion Saint Louis Cardinals (not surprisingly) proved to have the fewest number of overpaid players compared to the number of “overperforming” players on their rosters. Predictably, the New York Yankees finished dead last. It’s not easy to find overperforming players when your payroll is the highest in baseball.