Category Archives: statistics

When is an average or not an average?

First, I have not followed up on the reference in this Instapundit blog post, so I do not know whether this is really the way the British Met Office really computes average temperatures:

In fact, the Met still asserts we are in the midst of an unusually warm winter — as one of its staffers sniffily protested in an internet posting to a newspaper last week: “This will be the warmest winter in living memory, the data has already been recorded. For your information, we take the highest 15 readings between November and March and then produce an average. As November was a very seasonally warm month, then all the data will come from those readings.”

But suppose for a minute that an average for a whole season (year) really is fifteen days, and that the highest fifteen all (most) came from November. Would that really represent the average for the winter? Can you think about why they would not use every day’s reading from November through March? Should they use just the high for the day, or the high and the low for each day? And where should the reading come from? How many locations around England would be enough for what would seem like a good average calculation? What if averages in the past were computed differently? Could you make comparisons?

Average seems like such an easy concept – but it often is quite tricky. To say nothing about trying to understand the variability of temperatures (do they compute standard deviations?)

For example, the fifteen day record could occur in a season where the other one hundred and thirty five days were pretty cold…

What do we know?

I am naturally somewhat of a skeptic, but it is hard to not read something and think about how smart the scientists are, and how much they know, and what will soon be possible. One of the things that is really interesting is brain research. Suppose we could know how we work?

All kinds of new tools allow researchers to “watch” your brain work. Or can they? How do they “know”?

Here some interesting reading about the difficulties in measuring, knowing and statistics.

Here is another that makes you think twice when we assume that science is a clean process that follows a straight line to the truth, getting it right along the way.

And here’s a cartoon view of how this works (or doesn’t work).

Simpson’s paradox

Averages seem like simple things, but not always. Consider this simple baseball example:

Tony and Joe are competitive friends and so they compare batting averages. At the All-Star break, Tony is batting .300 and Joe is only batting .290. Joe mentions that batting in the second half of the season is more important, and so he and Tony agree to compare their batting averages for the second half of the season (and only the second half). When they finally meet, it turns out that Tony batted .390 in the second half of the season. Joe did better, too, but only batted .375. Tony wins both halves of the season.

Question: who’s batting average was higher for the entire year? Turns out we don’t know, and it could very easily be Joe! (I’ll post an example later)

What the paradox states is that averages for subgroups can demonstrate relationships that are inconsistent with averages for different subgroups or the overall averages. So Tony could win both halves of the season, but have a lower batting average for the entire season.

I do not know the details about Climategate, but it is very interesting to me, with all the statistics. Are average temperatures going up or down or whatever. Here’s an article that about midway through mentions this batting average paradox. I wonder if I have a new example of the paradox involving averages.

Means and medians

Here is an article from the Wall Street Journal about the housing market in Detroit. Here’s how the story starts:

On a grassy lot on a quiet block on a graceful boulevard stands the answer to a perplexing question: Why does the typical house in Detroit sell for $7,100?

The brick-and-stucco home at 1626 W. Boston Blvd. has watched almost a century of Detroit’s ups and downs, through industrial brilliance and racial discord, economic decline and financial collapse. Its owners have played a part in it all. There was the engineer whose innovation elevated auto makers into kings; the teacher who watched fellow whites flee to the suburbs; the black plumber who broke the color barrier; the cop driven out by crime.

The last individual owner was a subprime borrower, who lost the house when investors foreclosed.

Then the article sites some statistics:

And the median selling price for a home stood at a paltry $7,100 as of July, according to First American CoreLogic Inc., a real-estate research firm — down from $73,000 three years earlier. A typical house in Cleveland sells for $65,000. One in St. Louis goes for $120,000.

Now I understand that median house prices are the standard way of talking about what is “typical.” Using the mean creates problems when most houses are of one value, but there are some outlier, high-priced houses.

But in this article, is median misleading in a different way? Did the house on Boston Blvd sell for $7,100? How many houses were sold in Detroit? Voluntarily? Can the dynamics associated with foreclosures make the median a poor choice to report?

Often we try to summarize with a few statistics, and that can be useful. But here I think there is need for more information to really understand this situation. The reporter probably had access to that data, if they wanted to report it.

Statistics versus anecdotes

When you report on something and you have statistics and anecdotes, which should win? Which should be the basis for your conclusions? If you had data on a large part of the US population and then you talked to your brother-in-law, which would be more important to report?

Check out this NYTimes article about the Exodus from Facebook, and think about how statistics and andecdotes are used. Ouch.