If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Members of the Board of Editors recognize the importance of providing a resource for researchers to insure quality and accuracy of reporting in the Journal. This second monograph of a periodic series focuses on study sample selection, sample size, and common statistical procedures using parametric methods, and the presentation of statistical methods and results. Attention to sample selection and sample size is critical to avoid study bias. When outcome variables adhere to a normal distribution, then parametric procedures can be used for statistical inference. Documentation that clearly outlines the steps used in the research process will advance the science of evidence-based practice in nutrition and dietetics. Real examples from problem sets and published literature are provided, as well as reference to books and online resources.
This is the second in a series of articles developed to interpret the guidelines for authors submitting manuscripts to the Journal of the American Dietetic Association (
) and provide relevant examples and interpretation for how to proceed with manuscript preparation to advance the field of nutrition and its practical applications. Our purpose here is to review study sample selection, sample size, and common statistical procedures using parametric methods, and the presentation of statistical results. The first article in this series (
) focused on study design and the development of testable research hypotheses, the beginning of all good-quality research. A future article will highlight nonparametric statistics. Another in this series will address appropriate measurement tools and methods of analysis, such as sensitivity, specificity, validity, reliability, and relative validity. In addition, issues of judgment, such as making appropriate inferences based on the study design and results, a priori hypothesis testing, post hoc analyses, and extrapolation will be examined in the future articles, as will common epidemiologic methods, including the appropriate use and reporting of odds ratios, relative risk, confidence intervals, and statistical significance as well as the concepts of chance, confounding, and interaction. The goal of the series is to serve as review for some readers and provide new information for others to advance the field of nutrition and its practical applications.
) clarified that the hypothesis specifies the population being studied. It is rare that an investigator would have the opportunity, time, or resources to measure an entire population of individuals. Consider the hypothesis, introduced previously (
), “There is no statistically significant difference at the P<0.05 level in plasma clotting times among Asian-American men between ages 45 and 60 years taking 3 g/day n-3 fatty acids as combined docosahexaenoic acid and eicosapentaenoic acid in capsule form or a placebo for 6 consecutive weeks.” Despite the narrow age range and the focus on a specific group of men, the population represented by this hypothesis still represents at least 840,000 individuals who live in all 50 states (
). Thus, it would be almost impossible to test this hypothesis on the entire population of interest. Thus, researchers will instead recruit a sample of the population to make conclusions about the entire population even if only one location in the United States is used for subject recruitment. The examples outlined below highlight the importance of using sampling methods with minimal bias that results can be assumed to apply to the population beyond the sample.
The difficulty with selecting a sample is ensuring that the sample is representative of the entire target population. The most straightforward approach to achieve this goal is to employ a simple random sample—individuals are selected in such a way that every individual has an equal chance of being selected. Methods of achieving this are available in several resources (
For the hypothesis above, each treatment group, in a completely randomized experimental design (ie, subjects randomized to receive n 3-fatty acids or subjects randomized to receive placebo), would be selected as part of a simple random sample. If a researcher in San Francisco had a list from the Department of Motor Vehicles of men between ages 45 and 60 years identifying themselves as Asian and residing in the San Francisco Bay area, this list could be used as the basis of selecting a simple random sample. However, those individuals without a driver’s license would not be on the list and would not be included in the process of choosing the sample. This is an example of undercoverage. A complete list of any population is rarely available so the researcher needs to be aware of any possible bias (something that can lead to conclusions that are systematically different from the truth) created by those individuals not on the list. The Behavior Risk Factor Surveillance System of the Centers for Disease Control and Prevention conducts interviews based on probability samples of telephone numbers (
). On average, using telephone lists will miss about 6% of the population and usually those individuals without phones are more likely to be low income or homeless and probably differ from the rest of the population.
A more serious source of bias is nonresponse that can occur when those individuals selected through a simple random sample refuse to participate either due to transportation, lack of time, or disinterest. In addition, the recruiter may be unsuccessful at contacting an individual even after several tries. A nonresponse rate >25% would cause concern about potential bias (
). The American Association for Public Opinion Research has developed definitions for response rates, cooperation rates, refusal (nonresponse) rates, and contact rates that are useful in describing overall quality of the final data (
). The response rate would be a fundamental piece of information to include in a manuscript. In addition, when making conclusions about the results of a study, the response rates would be taken into consideration.
When a comprehensive list of the target population is not available, a sampling approach commonly used is to focus recruitment to specific geographic or community areas to capture a large proportion of the target group through posted flyers, newspaper ads, or media ads. This is referred to as a convenience sample. In this case, the subjects choose themselves rather than being randomly selected. These individuals can present some type of systematic bias. This is particularly true for opinion polls where those individuals who have strong opinions either for or against the topic are more likely to participate. Thus, a research study about taste properties of flavored milk would be more likely to attract those individuals that like and drink milk. Nonetheless, the convenience sample is often the only viable method available for most clinical studies. Researchers can minimize the potential bias by constructing a systematic, purposive methodology of recruitment and outlining these steps in the report. For example, a concerted effort to select consecutively every accessible person that meets the study criteria will help minimize the volunteerism effect.
Another aspect of selecting a study sample is establishing selection criteria as applicable to the research question. These criteria would be specific inclusion criteria or exclusion criteria. In the example above, the research hypothesis dictated the inclusion of the demographic characteristics of Asian origins, men, and a specific age group. Other inclusion criteria to consider would be clinical characteristics, such as generally healthy with no diagnosis of cardiovascular disease. If the research is limited to a specific location then the inclusion criteria may include patients seeking treatment at a specific medical center and even further specification of time such as January to June of a particular year. The exclusion criteria address threats to the subjects or to the quality of the data. For example, recruiting individuals outside of the San Francisco Bay area would be a threat to data because the budget may not support recruitment efforts and transportation for subjects beyond the target area. Thus, excluding individuals outside of the catchment area will help preserve sample size, response rate, and retention. Whereas if an individual is receiving treatment for diagnosed cardiovascular disease, the addition of 3 g/day n-3 fatty acids may interfere with treatment. Thus to avoid putting a subject at risk for side effects, current treatment for cardiovascular disease may be an exclusion criterion.
A researcher is responsible for describing the validity of the sample as appropriate for answering the research question. This would include reference to the sample design, methods of recruitment, the rate of nonresponse, inclusions and exclusions, and the final sample size that is large enough to meet the study needs. These factors need to be considered to make conclusions about how much the sample can be generalized to the population. Lohse and collegues (
) conducted an extensive survey among women aged 18 years and older participating in the Special Supplemental Nutrition Program for Women, Infants, and Children. The study sample used was a convenience sample; however, the authors clearly outline the eligibility criteria and the purposive sampling plan employed to minimize bias and best represent the Special Supplemental Nutrition Program for Women, Infants, and Children population. When biological factors are examined in observational and experimental studies, generalizing the results to a wider population is more acceptable than descriptive studies that enumerate the distribution of a factor in a sample population (
). For example, the strength of fruits and vegetables as a risk modifier for certain cancers tends to be more consistent among diverse populations than the prevalence of individuals consuming five or more fruits and vegetables per day. Therefore, the decision to generalize results from a single sample to a wider population requires consideration of many issues including sampling design, participation rates, and biological processes.
Importance of Estimating Sample Size
One of the common mistakes in research is failure to estimate an appropriate sample size before embarking on a research project. If the sample size is too small even the best study cannot detect an important effect and this can contribute to further confusion surrounding a topic. The process of estimating sample size can be technically complex and a statistician can assist with this process. Researchers will find useful a reference written specifically for nutrition and dietetics that outlines seven steps to estimating sample size (
). In addition, there are sample size calculators available online. One example of such a site is one that was created with support from the National Institutes of Health General Clinical Research Center Program (http://hedwig.mgh.harvard.edu/sample_size/size.html). Other programs can be found by using an online search engine to find “sample size calculators.”
Access to these calculators is a wonderful convenience; however, it does not preclude a researcher from completing the steps to determine the data needed for the calculations. To prepare for a visit to a statistician or a Web site calculator, a researcher needs to complete at a minimum the steps outlined in Figure 1. The researcher is the most qualified individual to state the outcome to be measured, how the outcome will be measured, what will be a meaningful result, and how much variation exists in the selected measure. A statistician can assist with the selection of an appropriate statistical test, as well as provide guidance with regard to choosing an appropriate level of error and power for the study. A statistician cannot provide advice unless the researcher has either completed a pilot study or extracted from the published literature information from other studies measuring the same or similar outcomes.
Before data analysis, all data need to be checked even if data entry involved verification methods (dual entry), scanning, computer entry, or Web-based entry. Useful first steps are to run frequency analyses of every variable and then review the results to ensure that the output matches the expected values for each variable. Errors, implausible values, and outlier values need to be checked and any errors need to be corrected. A height of 93 in among 5-year-olds is outside of the expected range of 39 to 47 in and most likely represents an error that needs to be corrected before data analysis can begin. If a more realistic value is not available and the value of 93 in is considered biologically implausible, then the value is best changed to missing. On the other hand, if the value is checked and it is within the realm of reality (eg, 49 in) then it would be considered an outlier. Because an outlier represents an observational value, any analyses should include the outlier value. Under these circumstances, analysis methods to consider using include separation into groups, such as quartiles, and nonparametric analytical methods (a topic of a future article in this series). As an alternative, results can be presented with and without outliers. Once the data have been checked and edited, the first step of analysis is to simply look at the data using descriptive statistics. The purpose of this step is to become familiar with the data and to create a description of the study population (
For the quantitative variables, this might be plots or frequency histograms. Use of stemplots to examine the shape of a distribution and to detect outliers is thoroughly discussed by Moore and McCabe (
). There are no simple rules for dealing with outliers in data unless, of course, the outlier represents an error; in which case it may be removed. Otherwise the researchers need to communicate clearly any decisions regarding the handling of an outlier. Other summary statistics would be means, medians, standard deviations, and range of values. For categorical variables, frequency distributions would be completed.
The data may be more meaningful by creating groups and calculating the proportion of subjects in each group (eg, overweight and not overweight). The results of this step often become the first table in a manuscript that outlines the characteristics of the study sample (eg, sex, age, and body mass index). See Figure 2 for an example of a characteristics table adapted from Boushey and colleagues (
) included in the statistical analyses section, “To examine the characteristics of adolescents …, descriptive statistics were calculated on the adolescents who responded to all of the measures used in the analysis.”
Upon completing a study a researcher wants to ultimately discover the outcome of the hypothesis (eg, Did the intervention make a difference? Are vitamin levels in location A different from vitamin levels in location B?) An observed effect that is so large that it would rarely occur by chance is called statistically significant. To make a conclusion about a result being statistically significant, the appropriate statistical test needs to be completed.
Before data collection, a researcher would have planned the inferential statistics to be used for data analysis (
). The decision trees in these flowcharts are instructive because they highlight the importance of determining in advance the hypothesis, study design, and types of variables to be measured (these concepts were covered in the first article in this series [
) and an interactive online flowchart is available from the Institute for Social Research at The University of Michigan: (http://www.microsiris.com/Statistical%20Decision%20Tree/). Two examples of following a decision tree are outlined in Figure 3, Figure 4. A review of several concepts, such as normal distributions and independent or dependent samples will assist with appreciating the importance of making appropriate choices for statistical inference.
Density Curves and Normal Distributions
The most well-recognized measure of central tendency is the arithmetic mean or average. For the following five serum low-density lipoprotein (LDL) cholesterol level measurements: 74 mg/dL, 94 mg/dL, 113 mg/dL, 121 mg/dL, and 135 mg/dL, the mean is 107 mg/dL. The mean represents the value of the five values added together and divided by the number of observations. When summarizing data it is appropriate to report the mean when the data are normally distributed. Statistical tests that compare means, such as two-sample t tests and analysis of variance (ANOVA), are called parametric tests and assume that the data from the samples being compared are normally distributed. Adherence to a normal distribution can be evaluated by creating a density curve or plotting the data as a histogram and overlaying a normal curve as shown in Figure 5A for estimated dietary calcium intake among sixth-grade girls. Computer programs (Figure 6) enable researchers to easily examine whether data adhere to parameters of a normal distribution vs tedious plotting by hand. An alternative to the histogram is a normal probability plot that plots a variable’s cumulative proportions against the cumulative proportions of a normal distribution. If the selected variable adheres to the normal distribution, the points cluster around a straight line as shown in Figure 5B for the same data shown in Figure 5A.
When the data are not normally distributed with skewed distributions or extreme values, the mean is not a very good measure of central tendency. For example, in the LDL cholesterol level example given above if the last listed value, 135 mg/dL, is changed to 180 mg/dL, the mean changes dramatically to 116 mg/dL. In this case, the median is a better descriptor of the data. The median is the data value that splits the data array in half. Half of the data values are below the median and half above it. For the LDL cholesterol examples above, the median is 113 mg/dL. Notice that it does not change with the addition of an extreme value. When data are not normally distributed it is important to report medians not means, unless the data can be transformed to normality successfully by using log or trigonometric transformations. By applying a function, such as the logarithm, data can be transformed to a more normal distribution. Dietary and nutrient data are often skewed as shown in Figure 7 (A1 and B1). For a researcher to proceed with parametric statistical inference, it is necessary to transform the data by applying a function such as the logarithm or a cubed root. There are systematic principles that describe how transformations perform and can speed the process of applying an appropriate transformation (
). For example, right-skewed data as shown in Figure 7 (A1 and B1) can usually be transformed to a more normal distribution with the use of the natural logarithm as shown in Figure 7 (A2 and B2). It is important to note that the mean of the transformed data must be reported in this case and not the mean of the raw, untransformed data. This transformed mean is often not very meaningful for a reader, thus reporting of the median can be very useful.
If the data are not normally distributed and cannot be successfully transformed then parametric tests cannot be used. Nonparametric tests must be used, such as the the Mann-Whitney test or the Kruskal-Wallis test that use proportions or rankings as the measures for comparison. As an alternative, the data can be divided into groups to create a categorical field. For categorical data, the χ2 test (independent samples) or the McNemar test (dependent samples) can be used. Assumptions and uses of nonparametric tests and tests using the binomial distribution will be a topic of a future article in this series.
A common mistake researchers make is not testing their data for normality. This can sometimes result in authors reporting means and standard deviations and using parametric tests for inferential statistics when the data are not normally distributed. Thus, the results are not valid. The analytical step of inspecting the data for adherence to a normal distribution is a fundamental piece of information to include. For example, authors can write, “Normal probability plots were used to assess the need for transformations. No variable required a transformation.” Or, if a variable needed transformation, specify the variable and transformation function used.
It is important to note that standard deviations also lose their relevancy if the data are not normally distributed. Standard deviations use all the data, including extreme values and means in the calculation. It is not appropriate to report standard deviations for data that do not meet the assumptions of a normal distribution.
Comparing Means of Independent and Dependent Samples
In the first example of using the flowchart decision tree (Figure 3), the exposure of interest was a categorical variable that created two distinct groups: girls who like milk and girls who dislike milk. The outcome of interest was dietary calcium intake, which is a quantitative variable. Thus, before applying a statistical test it was essential to determine if the calcium data met the assumptions of a normal distribution. For this particular case, the recommended statistical inference test was the two-sample t test or independent samples t test (same test, different name). This test is used when analysis involves a dichotomous categorical variable and a quantitative variable. This is a common test that is used in dietetics and nutrition research. For example, the two-sample t test would be used to compare means of glycated hemoglobin (quantitative variable) between two groups (dichotomous categorical variable) of individuals with type 1 diabetes; one group using a traditional insulin injection regimen and one group using a tightly controlled, multiple injection strategy. Examples of articles that used the two-sample t test can be found in work by Lohse and colleagues (
Sometimes the two-sample t test is confused with the paired t test. These are not the same. A paired t test is used when comparing a quantitative variable with related samples. Examples of this include measurements of the same patients before and after an intervention or pretest and posttest scores of students participating in a nutrition education session. See Figure 4 for the decisions that occur in the flowchart decision tree when working with dependent samples. Another example of using the paired t test is provided by Klohe-Lehman and colleagues (
) when comparing pre- and postintervention scores for nutrition knowledge.
Due to the availability of computer programs designed to calculate a wide variety of statistical tests (Figure 6), much of the burden in completing descriptive and inferential statistics has been reduced. Yet these programs do not have built-in systems to check if the statistical test used is appropriate. A researcher is still responsible for ensuring that the statistical tests used are appropriate based on the research design, sampling frame, the data distributions, and the outcomes. More importantly, the interpretation of any statistical test can only be made by a researcher; this is not provided by the computer program. Any data preparation to transform variables or recode a quantitative variable to a categorical variable is still the task of an investigator. For example, the data preparation will differ when study design involves independent samples vs dependent samples as shown in Figure 8. Authors need to include information about any computer program used, including program name, version, version release date, company name, company location.
It is important to appreciate that certain assumptions must be met when conducting t tests (or any test, for that matter). For the two-sample t test, the data from each sample must be normally distributed or be mathematically transformed as such. For the paired t test, the difference in the before and after measures must be normally distributed. These tests are robust when it comes to minor deviations from normality; however, nonparametric versions must be used when the violation of this assumption is more substantial (
Correlation is the measure to use when looking for a potential relationship between two quantitative variables. A common correlation of interest in research is the relationship between two similar measures to see if one can be substituted for another; for example, dietary calcium intake estimated from a food frequency questionnaire vs 24-hour food recalls (
). A correlation is often calculated when conducting cross-sectional research because relationships between variables are analyzed at one isolated point in time. Correlations do not distinguish between the explanatory variable and the response variable; rather, it quantifies the relationship.
Correlations measure the direction and magnitude of a relationship between two quantitative variables by calculating the correlation coefficient. The Pearson correlation coefficient is usually written as r and has an absolute value and a sign. The absolute value represents the magnitude of the association between variables. The sign of r indicates the direction of the relationship. Correlation coefficients can range from −1.0 to 1.0. The larger the absolute value of the coefficient the stronger the relationship. The interpretation of the coefficient is dependent on the discipline and the variability that exists in the items being measured. A correlation of 0.9 may be very low under the condition of verifying a physical law using high-quality instruments, but may be regarded as very high in the social sciences where few factors can actually be controlled. Further, the strength of a correlation varies with sample size. At large sample sizes, smaller correlations can be statistically significant. Thus, set cutpoints for interpretation of coefficients are in some ways arbitrary and should be used with discretion.
If the sign is negative it means as one variable increases the other tends to decrease and this can be described as a negative or inverse correlation. If the sign is positive there is a direct or positive relationship, meaning that as one variable increases the other variable also increases. A coefficient of −1.0 or 1.0 means there is a perfect correlation between the variables, either inverse or direct (
As with other parametric tests, assumptions must be met to conduct the test. The values for each variable being compared must be normally distributed or transformed as such. When describing the results of the correlation between two variables, provide the means and standard deviations of both variables, as well as the r value and the direction of the relationship as either positive or negative (
One-way ANOVA is used when there is a situation in which a relationship is being examined between a quantitative dependent variable and an (independent) categorical variable. Typically, if the categorical variable has two groups, a two-sample t test is used as discussed above. This test of statistical inference is used to test the hypothesis that several means are equal. For example, a one-way ANOVA would be used to compare mean hemoglobin A1c (HbA1c) values between persons with type 2 diabetes receiving an exercise program, a low-glycemic-index diet, metformin, or a placebo. The ANOVA result only determines if there is a statistically significant difference somewhere between the means, but not which groups are different from one another. If the test statistic result (F statistic) for an ANOVA is significant (eg, P<0.05) then the investigator can proceed with completing post hoc tests, which can distinguish which specific groups have statistically significantly different means. Common post hoc tests include the Bonferroni, Scheffe, and Tukey tests. There are principles to apply when selecting an appropriate post-hoc test to determine which specific groups are statistically significantly different (
) for a step-bystep sequence of analysis using ANOVA accompanied by instructive visuals.
There are other ANOVA statistics that are used when the research question is more complex and requires more than a single variable. These techniques are called multifactor or multiway ANOVA, repeated measures ANOVA, and multivariate ANOVA (MANOVA). All types of ANOVA assume that the samples compared are normally distributed and variances between samples are equal. If there is too much deviation from these assumptions nonparametric versions of these tests must be used. Some of these ANOVA methods are not necessarily covered in an introductory statistics class. Thus, researchers are encouraged to consult a statistician if their research questions are similar to the scenarios outlined below. For those individuals familiar with these tests that merely desire a refresher, there are comprehensive textbooks (
Multifactor ANOVA is used when a relationship is being examined between a continuous variable and more than one categorical variable. For example, continuing with the diabetes groups above, examining differences in HbA1c levels between treatment groups as well as African Americans and non-Hispanic whites may also be important. With the multifactor ANOVA, the independent effects of the treatment groups and the race/ethnicity status as well as their joint effects can be examined. The independent effects are called main effects and the joint effects are referred to as interactions. If it is found that the interactions are statistically significant, then the relevance of the main effects is mute. For example, if metformin is shown to significantly lower HbA1c for the African-American group, but not the non-Hispanic white group, then the general question, “Is metformin a more effective treatment?” becomes irrelevant. The results would indicate that the effectiveness is dependent on whether the subject was in the African-American group or the non-Hispanic white group. Similar to one-way ANOVA, post hoc tests need to be used to find where specific differences between groups exist.
Repeated measures ANOVA is used to compare changes in a continuous variable over time or changes in a group of subjects when different treatments are used. If a researcher wanted to look at the effect of different diets on systolic blood pressure, one approach might be to recruit 50 people who would consume four diets in a randomized order. Each diet would be consumed for 2 weeks followed by a 2 weeks washout period. The four diets might be a typical American diet, the Dietary Approaches to Stop Hypertension diet, the Mediterranean diet, and a high animal protein/low carbohydrate diet. Systolic blood pressure would be measured for each person at the beginning of each diet period, specified times within each diet period, and at the end of each diet period. Repeated-measures ANOVA could be used to look at the differential effect of the diets on systolic blood pressure. Because all subjects would receive all diets, there would be a need to account for the variation within each subject. Repeated-measures ANOVA accounts for this within- and between-subject variations. Just as with one-way ANOVA, post hoc tests are needed to determine where the differences between diets exist. The effectiveness of interventions are often assessed with repeated measures ANOVA (
MANOVA is used when a relationship between one or more categorical variables and more than one continuous variable are being examined. Using one of the examples given above, the research question now expands to examining the relationship between different treatment modalities (independent variables) for type 2 diabetes (ie, exercise program, a low-glycemic-index diet, metformin, or a placebo) and both HbA1c and LDL colesterol levels (dependent variables). This multivariate statistical test allows for a result with one statistical test rather than completing multiple tests. The changes in HbA1c and serum LDL cholesterol levels could be tested separately; however doing multiple tests on the same sample can increase the chances of committing a Type 1 error. Thus the completion of the single MANOVA decreases the risk of a Type 1 error. With MANOVA, post-hoc tests will need to be conducted to ferret out the specific categorical differences. A description of the variables used and the inferential statistics is important. Rousset and colleagues (
) used MANOVA to assess the effectiveness of nutrition messages and sex to change the consumption of six protein-rich foods.
All research begins with a research question that is the precursor to a testable hypothesis. Implementation of study design, sample size calculations, sampling methods, and inferential statistics provides the basis to assess if the hypothesis is true or not. The opportunities to develop meaningful research hypotheses occur frequently in the field of dietetics. Food and nutrition professionals are encouraged to pursue research methods to assist with building an evidence base for practice. The final step in research is preparing a report that may enter the peer-review process as a manuscript. To make the publishing process less intimidating, prospective authors can follow the information outlined in this series of articles that is devoted to publishing nutrition research.
Data presented in this article come from actual research problems and have been modified to illustrate the ideas presented.
American Dietetic Association
Journal of the American Dietetic Association guidelines for authors.