Population proportion

A population proportion, generally denoted by P and in some textbooks by \pi,[1] is a parameter that describes a percentage value associated with a population. For example, the 2010 United States Census showed that 83.7% of the American Population was identified as not being Hispanic or Latino. The value of 83.7% is a population proportion. In general, the population proportion or any other population parameter is unknown. A census can be conducted in order to determine the actual value of a population parameter, but in most statistical practices, a census is not a practical method due to its costs and time consumption.

A population proportion is usually estimated through an unbiased sample statistic obtained from an observational study or experiment. For example, the National Technological Literacy Conference conducted a national survey of 2,000 adults to determine the percentage of adults who are economically illiterate. The study showed that 72% of the 2,000 adults sampled did not understand what a gross domestic product is.[2] The value of 72% is a sample proportion. The sample proportion is generally denoted by \hat{p} and in some textbooks by p.

Mathematical definition

A Venn Diagram illustration of a set R and its subset S. The proportion can be calculated my measuring how much of S is in R.

A proportion is mathematically defined as being the ratio of the values in a subset S to the values in a set R.

As such, the population proportion can be defined as follows:

P= \frac{X}{N} where X is the count of successes in the population and N is the size of the population.

This mathematical definition can be generalized to provide the definition for the sample proportion:

\hat{p}= \frac{x}{n} where x is the count of successes in the sample and n is the size of the sample obtained from the population.[3]

Estimation

One of the main focuses of study in inferential statistics is determining the "true" value of a parameter. Generally, the actual value for a parameter will never be found unless a census is conducted on the population of study. However, there are statistical methods that can be used to get a reasonable estimation for a parameter. These methods include confidence intervals and hypothesis testing.

Estimating the value of a population proportion can be of great implication in the areas of agriculture, business, economics, education, engineering, environmental studies, medicine, law, political science, psychology, and sociology.

A population proportion can be estimated through the usage of a confidence interval known as a one-sample proportion in the Z-interval whose formula is given below:

\hat{p}
\pm
z^*
\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} where \hat{p} is the sample proportion, n is the sample size, and z^* is the upper \frac{1-C}{2} critical value of the standard normal distribution for a level of confidence,     C. [4]

Proof

In order to derive the formula for the one-sample proportion in the Z-interval, a sampling distribution of sample proportions needs to be taken into consideration. The mean of the sampling distribution of sample proportions is usually denoted as \mu_\hat{p}
=
P and its standard deviation is denoted as \sigma_\hat{p}
=
\sqrt{\frac{P(1-P)}{n}}. Since the value of P is unknown, an unbiased statistic \hat{p} will be used for P. The mean and standard deviation are rewritten as \mu_\hat{p}
=
\hat{p} and \sigma_\hat{p}
=
\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} respectively. Invoking the Central Limit Theorem, the sampling distribution of sample proportions is approximately normal.

Suppose the following probability is calculated: P(-z^*<\frac{\hat{p}-P}{\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}}<z^*)
=
C
, where 0<C<1 and \pm
z^* are the standard critical values.

The sampling distribution of sample proportions is approximately normal when it satisfies the requirements of the Central Limit Theorem.

The inequality -z^*<\frac{\hat{p}-P}{\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}}<z^*
can be algebraically re-written as follows:

-z^*<\frac{\hat{p}-P}{\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}}<z^*
\Rightarrow
-z^*{\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}}<\hat{p}-P<z^*{\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}}
\Rightarrow
-\hat{p}-z^*{\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}}<-P<-\hat{p}+z^*{\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}}
\Rightarrow
\hat{p}-z^*{\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}}<P<\hat{p}+z^*{\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}}
From the algebraic work done above, it is evident from a level of certainty C thatP could fall in between the values of \hat{p}
\pm
z^*
\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}.

Conditions for inference

In general, the formula used for estimating a population proportion requires substitutions of known numerical values. However, these numerical values cannot be "blindly" substituted into the formula because statistical inference requires that the estimation of an unknown parameter be justifiable. In order for a parameter's estimation to be justifiable, there are three conditions that need to be verified:

  1. The data's individual observation have to be obtained from a simple random sample of the population of interest.
  2. The data's individual observations have to display normality. This can be verified mathematically with the following definition:
    • Let n be the sample size of a given random sample and let \hat{p} be its sample proportion. If n
\hat{p}
\geq
10 and n(1-\hat{p})\geq10, then the data's individual observations display normality.
  3. The data's individual observations have to be independent of each other. This can be verified mathematically with the following definition:
    • Let N be the size of the population of interest and let n be the sample size of a simple random sample of the population. If N\geq10n, then the data's individual observations are independent of each other.

The conditions for SRS, normality, and independence are sometimes referred to as the conditions for the inference tool box in most statistical textbooks.

Example

Suppose a presidential election is taking place in a democracy. A random sample of 400 eligible voters in the democracy's voter population shows that 272 voters support candidate B. A political scientist wants to determine what percentage of the voter population support candidate B.

To answer the political scientist's question, a one-sample proportion in the Z-interval with a confidence level of 95% can be constructed in order to determine the population proportion of eligible voters in this democracy that support candidate B.

Solution

It is known from the random sample that \hat{p}
=
\frac{272}{400}
=
0.68 with sample size, n
=
400

Before a confidence interval is constructed, the conditions for inference will be verified.

(400)
(0.68)
\geq
10
\Rightarrow
272
\geq
10 and (400)
(1-0.68)
\geq
10
\Rightarrow
128
\geq
10

The condition for normality has been met.

N
\geq
10(400)
\Rightarrow
N
\geq
4000

The population size N for this democracy's voters can be assumed to be at least 4,000.

Hence, the condition for independence has been met.

With the conditions for inference verified, it is permissible to construct a confidence interval.

Let \hat{p}
=
0.68
,
n
=
400
, and C
=
0.95

To solve for z^*, the expression \frac{1-C}{2} is used.

\frac{1-C}{2}
=
\frac{1-0.95}{2}
=
\frac{0.05}{2}
=
0.0250

The standard normal curve with z^* which gives an upper tail area of 0.0250 and an area of 0.9750 for Z
\leq
z^*.
A table with standard normal probabilities for Z\leq z.

By examining a standard normal bell curve, the value for z^* can be determined by identifying which standard score gives the standard normal curve an upper tail area of 0.0250 or an area of 1 - 0.0250 = 0.9750. The value for z^* can also be found through a table of standard normal probabilities.

From a table of standard normal probabilities, the value of Z that gives an area of 0.9750 is 1.96. Hence, the value for z^* is 1.96.

The values for \hat{p}
=
0.68, n
=
400, z^*
=
1.96 can now be substituted into the formula for one-sample proportion in the Z-interval:

\hat{p}
\pm
z^*
\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
\Rightarrow
(0.68)
\pm
(1.96)
\sqrt{\frac{(0.68)(1-0.68)}{(400)}}
\Rightarrow
0.68
\pm
1.96
\sqrt{0.000544} \Rightarrow
\bigl(0.63429,0.72571\bigr)

Based on the conditions of inference and the formula for the one-sample proportion in the Z-interval, it can be concluded with a 95% confidence level that the percentage of the voter population in this democracy that support candidate B is between 63.429% and 72.571%.

Value of the parameter in the confidence interval range

A commonly asked question in inferential statistics is whether the parameter is included within a confidence interval. The only way to answer this question is for a census to be conducted. Referring to the example given above, the probability that the population proportion is in the range of the confidence interval is either 1 or 0. That is, the parameter is included in the interval range or it is not. The main purpose of a confidence interval is to better illustrate what the ideal value for a parameter could possibly be.

Common errors and misinterpretations from estimation

A very common error that arises from the construction of a confidence interval is the belief that the level of confidence such as C
=
95% means 95% chance. This is incorrect. The level of confidence is based on a measure of certainty, not probability. Hence, the values of C fall between 0 and 1, exclusively.

See also

Binomial proportion confidence intervals

Confidence intervals

Inferential statistics

Parameter

Statistical hypothesis testing

References

  1. Introduction to Statistical Investigations. Wiley. ISBN 978-1-118-95667-0.
  2. Ott, R. Lyman. An Introduction to Statistical Methods and Data Analysis. ISBN 0-534-93150-2.
  3. Weisstein, Eric. CRC Concise Encyclopedia of Mathematics. Chapman & Hall/CRC.
  4. Hinders, Duane. Annotated Teacher's Edition The Practice of Statistics. ISBN 0-7167-7703-7.
This article is issued from Wikipedia - version of the Thursday, April 07, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.