Category Archives: statistics

Excel INDEX and Named Ranges

Excel allows you to refer to cells as just B6 or C17:D26 (for example), but ranges can also be given meaningful names like Market_Share. Here is a video to introduce named ranges and then using INDEX to access cells within a named range:

Named Ranges and INDEX

Creating graphs of data with the use of INDEX is quite useful. Here are some examples:

Single Variable Scatterplots

Histograms (using the FREQUENCY function)

XY Scatterplots

Excel Graphs

There are many kinds of graphs in Excel, and this page looks at a few of the most common types of graphs.

The first is just to do a scatterplot of all the data points. This allows us to “see” the variability of the data:

Scatterplots

Similar to this is to graph each point as a column. This gives us a bit more sense of how “big” each value is:

Column Charts

Similar to a column chart (one column per data point) is a histogram. Instead of plotting each point, the graph shows the number of points in various ranges (for example, 1-10, 11-20, etc.). There is a built-in tool for creating histograms in the Data Analysis tool pack:

Histograms using the Data Analysis Tool Pack

Histograms can also be constructed using the Excel function FREQUENCY():

Histograms using the FREQUENCY() function

And the last graph type we will look at is an XY scatterplot:

XYScatterplots

Excel Statistics

We can use Excel to create descriptive statistics for a data set. Here are a few videos with some help on how to do some basic tasks.

The filter command is often a good place to start with a new dataset:

Using Filter

In the Analysis Toolpack, there is a way to quickly calculate the descriptive statistics for a dataset:

Analysis Toolpack Descriptive Statistics

Sometimes, the Analysis Toolpack is missing from the Data menu tab. If it is, here is the way to install it:

Install the Analysis Toolpack

Descriptive statistics can also be computed using the built-in functions Excel provides:

Basic Statistical Functions

SAS/JMP Quick Start Guide

This is a quick start guide for SAS/JMP, the one of the statistical software packages used at Goizueta. It is Excel-friendly and requires no programming. I have created some videos to help get you started, once you have the software installed on your computer.

Here is a quick overview and some helpful preference changes that will make JMP a bit easier to use:

01-Overview

JMP will easily import data from Excel:

02-Import Data

Descriptive statistics with JMP:

03-Descriptive Statistics

JMP version of Excel pivot tables:

04-Tabulate

Scatterplots:

05-Scatterplots

Create single regression models:

06-Single Regression

Multiple regression models, with categorical data and nonlinear relationships:

07-Multiple Regression

Rediscovery

I was reading something today and something triggered my memory and I looked up this pearl of wisdom from St. Thomas Aquinas that I had forgotten.  It is going to be added to my decision analysis course syllabus:

Never deny, seldom affirm, always distinguish

 

What should we teach in college?

Lawrence Summers has an op-ed column in the NY Times about curriculum issues

http://www.nytimes.com/2012/01/22/education/edlife/the-21st-century-education.html?pagewanted=all

Here’s my favorite part:

6. Courses of study will place much more emphasis on the analysis of data. Gen. George Marshall famously told a Princeton commencement audience that it was impossible to think seriously about the future of postwar Europe without giving close attention to Thucydides on the Peloponnesian War. Of course, we’ll always learn from history. But the capacity for analysis beyond simple reflection has greatly increased (consider Gen. David Petraeus’s reliance on social science in preparing the army’s counterinsurgency manual).

As the “Moneyball” story aptly displays in the world of baseball, the marshalling of data to test presumptions and locate paths to success is transforming almost every aspect of human life. It is not possible to make judgments about one’s own medical care without some understanding of probability, and certainly the financial crisis speaks to the consequences of the failure to appreciate “black swan events” and their significance. In an earlier era, when many people were involved in surveying land, it made sense to require that almost every student entering a top college know something of trigonometry. Today, a basic grounding in probability statistics and decision analysis makes far more sense.

Populations and samples

Here’s a nice article in the NYTimes:

http://www.nytimes.com/2012/01/24/health/research/study-finds-no-childhood-obesity-link-to-school-junk-food.html

Researchers at Pennsylvania State University tracked the body mass indexes of 19,450 students from fifth through eighth grade. In fifth grade, 59 percent of the children attended a school where candy, snacks or sugar-sweetened beverages were sold. By eighth grade, 86 percent did so.

The researchers compared children’s weight in schools where junk food was sold and in schools where it was banned. The scientists also evaluated eighth graders who moved into schools that sold junk food with those who did not, and children who never attended a school that sold snacks with those who did. And they compared children who always attended schools with snacks with those who moved out of such schools.

No matter how the researchers looked at the data, they could find no correlation at all between obesity and attending a school where sweets and salty snacks were available.

Think about what population the researchers are interested in.  Is the sample they chose a good sample?  Does the size of the sample make you more or less confident of the results.

Now consider a question they could be studying: are children who eat junk food more or less obese?  Think about whether their data can answer this?

They are doing something slightly different, though.  Can you articulate the result carefully?  If you were not careful, are there other ways to frame the result that are not supported by the data, but sound like they do?

Problem framing

Decision analysis considers “framing” to be something important that can be easily overlooked (and lead to awful results).  For example, a couple could consider the decision to be which house to purchase, framing the decision as an investment one.  And miss the option of renting and investing the money in something besides real estate.

It is also true that people can find “meaning” in data that is nothing but random noise; or overlook important results that do not conform to the conventional wisdom.  Treatment for ulcers falls into the latter category – the conventional wisdom about the cause of ulcers turned out to be false for many people.

Here is an interesting example of how we frame data.  After you watch the video, make sure you go back and listen when the lyrics are not on the screen.  They really are there!  Or are they?

Plus or minus three standard deviations is not all there is

As a rule of thumb, we can think about most data being in a certain range, given the variability of the data.  There are very few ten feet tall people, for example.  But the assumptions of the normal distribution apply sometimes (heights) but not always.  This is especially true in financial markets and gave rise to many books, including the Black Swan (the book, not the movie).  Today, oil prices moved down, and when compared to the recent movements in the market, the change was over five standard deviations.  Although it rarely happens, it happens all the time, because parts of the world are not normal.