Category Archives: statistics

Regressions in The Atlantic (who would have thought?)

Interesting article in The Atlantic about how unions affect state economies.  Some scatterplots with regression lines (but no regression statistics, alas) and some correlation coefficients.  Geek stuff, for sure.

Some observations, though.  First scatterplot shows union participation rate as the independent variable and then the “income level” as the dependent variable.  It seems that this is per capita income, but it is not clear if it is mean or median.  The choice could be quite important.

If you look at the states on the left (lower income) part of the graph versus the right (higher income) part of the graph, what might you wonder about?  How about the cost of living in each of the states?  As our business school students quickly realize, the same salary offer in NYC versus other locations means a potentially significant difference in discretionary income.

The article does a clever academic bait-and-switch (I learned how to do this in graduate school).  The article starts talking about public sector unions, but then quickly drops the “public sector” part and does all of the statistical work about unions (both private and public).  Why do this?  One, it might be easier to get the data.  Another is that using the data set that is more appropriate does not tell the right story.  (note: I am not saying that is what the author is doing, but that alarm goes off in my head when I see data that really doesn’t fit being used in an argument.)

Here’s a quote from the paper:

To put it baldly, unions are associated with the country’s economic winners, not its losers.   And it’s not that unionized states work more–unionization is negatively correlated with hours worked (-.36). States with higher levels of union membership work less hours per week but make more money–higher levels of union memberships are positively correlated with wage per hour (.48).

Now suppose this is the union for a successful private company.  We are describing a classic win-win here, right?  Everyone is sharing in the success of the company (although I know some will wonder whether management is getting more than their share – that’s for another discussion).  And less hours/more pay only works if customers of the company will continue to buy the product at the price the company tries to charge.

Now suppose we say the same thing for public workers.  Less hours, paid more.  Who is paying?  And are they getting value for that?  Could be.  But maybe not.  So do these two correlation coefficients allow us to make “bald” assertions like this?  Maybe not…

Can you see the problems with the other two scatterplots?  What does the author claim they “show?”  Are there reasonable other explanations for the relationships?  Hidden variables (like state cost of living I mentioned above)?

What are your conclusions (and what question did you ask)?

Polling is a tricky business, and trying to understand issues and the public’s view requires a certain amount of art. When I teach my classes, I try to emphasize with them that they will see lots of “results” that come from “research” that may or may not line up.

Here’s an example from the NYTimes. The article begins by recognizing that there is some ambivalence about unions:

Labor unions are not exactly popular, though: A third of those surveyed viewed them favorably, a quarter viewed them unfavorably, and the rest said they were either undecided or had not heard enough about them.

Here is the result the article trumpets in the headline:

Americans oppose weakening the bargaining rights of public employee unions by a margin of nearly two to one: 60 percent to 33 percent.

Strong evidence, apparently. But, here is the actual question that leads to the result:

Collective bargaining refers to negotiations between an employer and a labor union’s members to determine the conditions of employment. Some states are trying to take away some collective bargaining rights of public employee unions. Do you favor or oppose taking away some collective bargaining rights of the unions?

I added the emphasis here – employer. Who exactly is the employer?  Who does the public employee union negotiate with?  Those answers are clear if we’re talking about a private company.  But who is the employer of a public employee?

The question introduces a premise in the question, and does it affect the choices of those asked this particular question? Would an alternate premise change the results?  Suppose the question mentioned that taxpayers are responsible for the result of the negotiations, but are not necessarily represented during the negotiations?

(btw, I think it is great that we regularly get to see the exact question, so that we can think more carefully about what we just “learned.”)

How do you measure that?

Here is an interesting editorial on the global warming debate. I have not looked at the underlying paper, but this paragraph is interesting:

As it happens, the project’s initial findings, published last month, show no evidence of an intensifying weather trend. “In the climate models, the extremes get more extreme as we move into a doubled CO2 world in 100 years,” atmospheric scientist Gilbert Compo, one of the researchers on the project, tells me from his office at the University of Colorado, Boulder. “So we were surprised that none of the three major indices of climate variability that we used show a trend of increased circulation going back to 1871.”

I think I understand how to measure someone’s height, and measuring their income is a bit trickier but you can come up with a reasonable procedure. In pharmaceutical research, measuring whether the drug “worked” is a potentially important decision – do we measure the size of the tumor, the increase in lifespan, or the quality of life?

Can you imagine what the three major indices of climate variability are? Could you come up with three different ones that would support a paper that has the opposite result (weather is weirder)?

When and how do we know things like this?

Graphs can say one thing, or another

First, I am not posting this to make any statement about Paul Krugman, or anything like that.  But when comparing time series data, choosing the starting point can be more imporant than it seems, as this blogger demonstrates.  You could imagine investing strategies being compared in a similar fashion.  One graph shows strategy A is better, but with the same data and a different start date, another graph would show that strategy B is better. 

Another issue is the choice of the other data sets – they can also be chosen to look “fair” but those time series contain special features that make the analysis less than fair.  For example, compare company X with specially chosen company Y, because of the special charges for Y in a particular quarter make some observation about X seem more true.

Simple comparisons are not always so simple.

A post about something I know little about

One of the questions I like to pose in my classes is “how do you know?”  Doing analysis of data and possible decision leads to lots of wild speculation, and this question sometimes brings us back to recognizing that widely held theories are actually quite tenuous.  One observation can bring down a whole theory, a “black swan” event.  I like to use examples from physics, even though I am not a physicist, because I find that students often think that we “know” stuff there.  Science is settled, even if other areas are not.

Here is an interesting (to me) article about measuring light and comparing the measurements to accepted physical theories.  At the end, something to think about:

There’s also a possibility that the explanation could be even more far-reaching, such as that the universe is not expanding and that the big bang theory is wrong.

How certain are you of the things you know?