Lecture 1

Lecture 1


Outline


Statistics and Variation


Statistics and statistics

Variation


Data

Analysis


Data Table

data-table


Metadata


Variable Types


Categorical

categorical-var


Quantitative


Identifier


Other


Example

Business analysts hoping to provide information helpful to grape growers compiled these data about vineyards in California and Michigan.


Example Variables


Surveys and Sampling


Sampling


Randomization


Sample Size Matters


Populations and Parameters


Sampling Designs


Simple Random Sample (SRS)

A sample drawn so that every possible sample has an equal chance of being selected is called a simple random sample.


Stratified Sampling

Reduced sampling variability is the most important benefit of stratifying.


Cluster and Multistage Sampling

Clustering sampling

Multistage samples


Systematic Samples


Sampling Designs (Example 1)

Researchers waited outside a bar they had randomly selected from a list of such establishments. They stopped every 10th person who came out of the bar and asked whether he or she thought drinking and driving was a serious problem. Identify the population of interest, population parameter, sampling frame and method.


Sampling Designs (Example 1, cont)


Sampling Designs (Example 2)

An amusement park has opened a new roller coaster. It is so popular that people are waiting for up to 3 hours for a 2-minute ride. Concerned about how patrons feel about this, they survey every 10th person on the line for the roller coaster, starting from a randomly selected individual. Identify sampling frame. Is the sample likely to be representative?


Sampling Designs (Example 2, cont)


Bad Sampling


Voluntary Response Sample


Convenience Sampling


Bad Sampling Frame

An SRS from an incomplete sampling frame introduces bias because the individuals included may differ from the ones not in the frame.


Undercoverage


Example 1

We want to know what percentage of local doctors accept Medicaid patients. We call the offices of 50 doctors randomly selected from local Yellow Pages listings. Is this sampling method appropriate? If not, identify the problem.

Is this method appropriate?


Example 1 (cont.)

We want to know what percentage of local doctors accept Medicaid patients. We call the offices of 50 doctors randomly selected from local Yellow Pages listings. Is this sampling method appropriate? If not, identify the problem.

Method appropriate: Depends on the Yellow Page listing used. If from regular listings, this is fair if all doctors are listed. If from ads, then probably not as those doctors may not be typical.


Example 2

We want to know what percentage of local businesses anticipate hiring additional employees in the upcoming months. We randomly selected a page in the local Yellow Pages and call every business listed there. Is this sampling method appropriate? If not, identify the problem.

Is this method appropriate?


Example 2 (cont.)

We want to know what percentage of local businesses anticipate hiring additional employees in the upcoming months. We randomly selected a page in the local Yellow Pages and call every business listed there. Is this sampling method appropriate? If not, identify the problem.

Method appropriate: Not appropriate. This cluster sample will probably contain listings for only one or two business types.


Displaying and Describing Categorical Data


Summarizing a Categorical Variable

Search EngineVisitsVisits (%)
Google50 62943.05%
Direct22 17318.85%
Bing12 27310.44%
Facebook32 53227.66%
Total117607100.00%

Bar Chart

A bar chart displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison.

cat-bar-chart


Relative Frequency Bar Chart

The relative frequency bar chart looks the same as the bar chart, but shows the proportion of visits in each category rather than counts.

cat-rf-bar-chart


Pie Chart

Pie charts show the whole group of cases as a circle sliced into pieces with sizes proportional to the fraction of the whole in each category.

cat-pie-chart


Exploring Two Categorical Variables

Example: Data was collected on the strength of consumers’ preferences for regional foods in their country. The data is displayed in the frequency table and clarified with a pie chart.

|ct-data| ct-pie-chart| |––––––––––-|–––––––––––––––-|


Contingency Tables

To show how opinions on regional foods varied by countries, we can display the data in a contingency table where we have added the countries as a new variable.

contingency-table


Contingency Tables (cont.)


Conditional Distributions

Variables may be restricted to show the distribution for just those cases that satisfy a specified condition. This is called a conditional distribution.

conditional-dist


Segmented Bar Charts

Data can be displayed by dividing up bars rather than circles. The result is a segmented bar chart where a bar is divided proportionally into segments corresponding to the percentage in each group.

conditional-dist-sbc


Example 1

GFK Roper Reports Worldwide survey in 2004, asked “How important is acquiring wealth to you?” The percent who responded that it was of more than average importance were: 71.9% China, 59.6% France, 76.1% India, 45.5% UK, and 45.3% USA.

cat-var-example1


Example 1 (cont.)

cat-var-example1


Example 2

A survey of the entering MBA students at a university in the United States classified the country of origin of the students, as seen in the table.

cat-var-example1


Example 2 (cont.)

cat-var-example1


Example 2 (cont.)

cat-var-example1

What is the marginal distribution of origin?


Example 2 (cont.)

cat-var-example1

The marginal distribution of origin is


Example 2 (cont.)

cat-var-example1

Do you think that origin of the MBA student is independent of the MBA programs?


Example 2 (cont.)

cat-var-example1

Origin of the MBA student is not independent of the MBA programs because the distributions appear to be different. For example, the % from Latin America among those in Two-Yr programs is nearly 20% while those in Evening Programs is less than 1%.