Is Sex a Discrete or Continuous Variable

Chapter 9 Testing Associations I: Difference of Means, F-test, and χ2 Test

9.1 Between a Discrete and a Continuous Variable: The t-test

For this part, you need to recall (from Section 7.2.1, https://pressbooks.bccampus.ca/simplestats/chapter/7-2-1-between-a-discrete-and-a-continuous-variable/) how we described bivariate associations between two variables, one of which is treated as discrete and one as continuous. In this case we essentially compared the groups (categories of the discrete variable) by their mean (or median) value on the continuous variable. We examine the potential association between such variables visually through boxplots and numerically through a difference of means.

Now the question in front of us is: even if we do see a difference in the means of the different groups in sample data, how certain can we be that this association is real and reflective of the population? As we learned in Chapter 8, to answer this question, we need to test the difference for statistical significance.

We start with a few theoretical notes, which we will then apply to the example I used in Chapter 7 about the potential gender difference in average income. In this way we will be able to test whether the difference observed in the NHS 2011 data ($16,401 in favour of men to be precise) is statistically significant or not. In the latter half of this section we will see what happens when there are more than two groups' means to compare.

Testing the difference of two means. Recall from Section 8.3 (https://pressbooks.bccampus.ca/simplestats/chapter/8-3-hypothesis-testing/) that we tested whether the employees who took a training course indeed had a higher average productivity by simply calculating the z-value (or, using the estimated standard error, the t-value with a given df) for the mean and then finding its associated p-value. We could then compare the p-value to the preselectedα-level and make a conclusion regarding the null hypothesis.

You will be happy to know that testing a difference of means follows the same principle: obtain thez (or rather, thet-value), get the associatedp-value, compare to theα. What is not the same is that now we are testing expressly a difference of two means — so we need thet-value for the difference. It turns out, we can calculate one as easily as ever, as long as we had the standard error of the difference ^[1].

The standard error of a difference of two means is a combination of their separate standard errors:

$\sigma_(\overline{x}_1-\overline{x}_2)$ $=\sqrt{\frac{\sigma_1^2}{N_1}+\frac{\sigma_2^2}{N_2}}$ = standard error of the difference of two means

where the subscripts refer to the first and second group being compared.

The z-value for a difference of two means follows the ordinary z-value formula, but with the difference taking the place of the single mean:

$z=\frac{(\overline{x_1} -\overline{x_2})-(\mu_1 -\mu_2 )}{\sigma_(\overline{x}_1-\overline{x}_2)}$

However, under the null hypothesis we hypothesize there is no difference in the population means, as such $\mu_1=\mu_2$ , and thus $\mu_1-\mu_2=0$ . Accounting for that in the formula, along with substituting the standard error with its own formula from above, we get:

$z=\frac{\overline{x_1} -\overline{x_2}}{\sqrt{\frac{\sigma_1^2}{N_1}+\frac{\sigma_2^2}{N_2}}}$

Finally, since we generally don't know the population parameters but work with sample data, we estimate the standard errorσ with the sample standard error s, thus moving to the t-value through which we test the difference for statistical significance:

$t=\frac{\overline{x_1} -\overline{x_2}}{\sqrt{\frac{s_1^2}{N_1}+\frac{s_2^2}{N_2}}}$ = t-test for the difference of means ^[2]

Note than unlike the single value case where the df=N-1, when working with a difference of means of two groups the df=N -2.

Before you eyes glaze over (completely), rest assured that SPSS calculates this for you; I only provide it here to show you that the logic of hypothesis testing is the same, only the formulas change to accommodate the testing of a difference of means rather than a single mean.

From this point on, it's easy: you only need to check the p-value of the t-value you have obtained (given the specific df)^[3], and compare it to the significance level, and voila — you have yourself a significance test!

Let's see how this all works out in an example. A few sections back I promised you to test the gender differences in average income, didn't I?

As in Example 7.2 in Section 7.2.1, I use a random sample of about 3 percent of the entire NHS 2011 data, this time resulting in N=21,902^[4].

We are still interested in whether women and men on average earn differently per year, i.e., whether gender affects income:

There are 11,323 women (N_f =11,323) and 10,579 men (N_m =10,579) in the sample. The men earn an average of $48,113 ( $\overline{x}_m =48113$ ) and women earn an average of $31,519 ( $\overline{x}_f =31,529$ ). The respective standard deviations are $68214 for men ( $s_m =68214$ ) and $34,760 for women ( $s_f=34760$ ).

The difference of means is therefore:

$\overline{x}_m -\overline{x}_f =48113-31519=16594$

The question is whether this $16,549 is due to sampling variation (i.e., statistically not different than a population difference of means of $0), or unusual enough so that a population mean of $0 to be unlikely (i.e., so the difference is statistically significant).

To test this, we need to calculate the standard error of the difference. Once we have the standard error of the difference, we can calculate the t-value.

The standard error of the difference is:

$s_\overline{x}_m-\overline{x}_f$ = $\sqrt{\frac{s_m^2}{N_m}+\frac{s_f^2}{N_f}}=\sqrt{\frac{68214^2}{10579}+\frac{34760^2}{11323}}=\sqrt{439848+106708}=739$

The t-value is then:

$t=\frac{\overline{x}_m -\overline{x}_f}{\s_(\overline{x}_m-\overline{x}_f)}=\frac{16594}{739}=22.446$

Given the large N, even just looking at the t-value should make it clear that the difference is statistically significant — after all, in a two-tailed test, the t-value is significant at 1.96 and on (forα=0.05) and at 2.58 and on (forα=0.01).

Still, this is not the way to report a test — this is: With a t=22.447, df=21,900, and p=0.000 ^[5] , and p<0.001 ^[6] , we have enough evidence to reject the null hypothesis. Indeed, we can conclude with 99.99% certainty that there is a statistically significant difference between the average annual income of men and women (i.e., that the difference exists in the population).

We can check this with a confidence interval too, again substituting the difference in place of a single value^[7]:

95% CI: $\overline{x}_m - \overline{x}_f \pm 1.96\times s_\overline{x}_m-\overline{x}_f$ = $16594 \pm 1.96 \times 739 = 16594 \pm 1448$ = $= (15145; 18043)$

That is, we can say that the difference of average annual incomes between men and women will be between $15,145 and $18,043 with 95% certainty; or that 19 out of 20 such studies will find a difference of $16,594 $\pm$ $1,448. (We also see the correspondence with hypothesis testing: since the interval does not contain 0, 0 is not a plausible value for the difference.)

Inference is not doing too badly, no?

Again, SPSS will provide all the calculations but I advise you to still test your understanding of the procedure with the following exercise.

Studies find that due to the gendered social construction of aging (i.e., women are considered "older" and "mature" at younger ages than men), male actors are frequently paired with much younger female actors (Buchanan 2013; Follows 2015). For example, the Oscars average age of male and female Academy Award nominees is telling: in the Best Actor category, the average age of men is 43.4 years while the average age of women is 37.2 years (Beckwith & Hester, 2018 [http://thedataface.com/2018/03/culture/oscar-nominees-age]).

Let's say that you want to investigate this phenomenon yourself. You randomly select 100 male and 100 femaleacademy award nominees, and calculate their age at nomination for an Academy Award. You find that men's average age is 45 years and women's is 36 years, with standard deviations of 15 years for men and 20 years for women. Test the hypothesis that the average age for women is different from that of men for the population of all Best Actor/Actress Oscar nominees. Create a 95% CI for the difference to see its correspondence with the hypothesis test.

Now that you understand the principle of testing the difference of two means, let's see what we can do about non-binary discrete variables, in the next section. The SPSS guidelines for doing a t-test are below.

From the Main Menu, select Analyze, and from the pull-down menu, click on Compare Means and Independent Samples T Test;
Select your continuous variable from the list of variables on the left and, using the top arrow, move it to the Test Variable(s) empty space on the right;
Select your discrete variable from the list of variables on the left and, using the bottom arrow, move it to the Grouping Variable empty space on the right;
Click on Define Groups, and in the new window, keep Use specified values selected; in the empty spaces for Group 1 and Group 2, enter the numeric values^[8] corresponding to the two categories of your discrete variable; click Continue.
In the Independent Samples T Test window click Options…; you can request specific confidence interval in the new window (the default is 95%); click Continue;
Click OK once back to the Independent Samples T Test window.
SPSS will produce two tables in the Output window: a Group Statistics one (where you can see sample size, the mean, standard deviation, and standard error for each group (category in the discrete variable), and an Independent Samples Test one (where you can find the t-value, df, p-value, mean difference, standard error of the difference, and the requested confidence interval)^[9].