Home Clark 2019 - Constructing Validity - New Developments in Creating Objective Measuring Instruments
Post
Cancel

Clark 2019 - Constructing Validity - New Developments in Creating Objective Measuring Instruments

Google Scholar Link

Clark, L. A., & Watson, D. (2019). Constructing validity: New developments in creating objective measuring instruments. Psychological assessment, 31(12), 1412.

Summary

This paper, from Clark and Watson, is an update to their highly cited 1995 paper with a similar name. Those two papers, along with the Hinkin et al. 1997 piece on scale construction (and maybe even the Loevinger 1957 paper), provide a valuable set of recommendations that should be reviewed before embarking on further scale development.

Scale generation starts with conceptualization. Part of this conceptualization should include an in-depth literature review of the construct, its existing measures, and its near neighbors and their respective measures. Proceed only if a new scale could offer “either a theoretical or an empirical improvement over existing measures or fill an important measurement gap.” Consider the hierarchical level which your proposed scale will target (todo: link Stanton 2001 is a good example of a broad scale). Broad constructs need to be defined explicitly in terms of their lesser components. Special cases include orphan and interstitial constructs, and conglomerate constructs are not recommended. Page 6 provides a mind-bending explanation of how a single item might serve as an indicator of five different constructs. Hence, wording of items should be carefully considered. When putting together an initial pool of items, the authors recommend that “the initial pool should be broader and more comprehensive than one’s theoretical view of the target construct and include content that ultimately will be eliminated.” If in doubt, put it in there. Additionally, each subscale of the construct should have an adequate number of items (referred to as sampling).

“Items should be simple, straightforward, and appropriate for the target population’s reading level. Avoid (1) expressions that may become dated quickly; (2) colloquialisms that may be not be familiar across age, ethnicity, region, gender, and so forth; (3) items that virtually everyone (e.g., “Sometimes I am happier than at other times”) or no one (e.g., “I am always furious”) will endorse; and (4) complex or “double-barreled” items that assess more than one characteristic; for example, “I would never drink and drive for fear that I might be stopped by the police,” assesses both a behavior’s (non)occurrence and a putative motive.”

Contrary to the work cited by Stanton 2001, this article cites research showing that psychometric quality increased (up to six response options) vs using dichotomous responses. Research is also cited that showed “no systematic differences between odd [allowing a neutral middle point] versus even number of response options.” In developing alternate versions of scales (whether in other languages or short forms), revalidation is rarely done, though it should be. Scale generation should be iterative, and the first pass at the data should use exploratory factor analysis to identify factors (which would ideally be your subscales). Later, your scale should be validated (with a different sample) using confirmatory factor analysis. They also recommend considering both oblique and orthogonal rotations separately.

“For most purposes, we recommend eliminating items (a) with primary loadings below .35-to-.40 (for broader scales; below .45-to-.50 for narrower scales) and (b) that have similar or stronger loadings on other factors…”

Item response theory is addressed briefly. It allows for computer-aided tests that use items that are maximally informative for the subject (based on their previous answers). This is similar to how the GRE gives progressively harder questions if you keep answering them correctly. The draw of IRT is that it frequently can reduce measure length by over 50%.

For initial data collection, scales for “near neighbor” constructs should be included as well, allowing for the researchers to assess divergent validity. Similar constructs’ measure should also be included so that convergent validity can be assessed. Items should be revised and the iterative cycle of scale generation should continue.

“Good scale construction is an iterative process involving an initial cycle of preliminary measure development, data collection, and psychometric evaluation, followed by at least one additional cycle of revision of both measure and construct, data collection, psychometric evaluation, revision. . .. The most often neglected aspect of this process is revision of the target construct’s conceptualization.”

Items that generate unbalanced and highly skewed response distributions should be carefully considered for elimination (if most people give the same answer, very little information will be gained). Internal reliability, as measured by Cronbach’s alpha, should be at least .8 (see Nunnally 1978). Instead of alpha, the authors recommend using the more robust average interitem correlation (AIC) to assess internal reliability, aiming for AIC measurements within the range of .15-.50. To ensure unidimensionality, all interitem correlations should be between .15 and .50. Finally, incremental validity, which “demonstrat[es] that a measure adds significantly to the prediction of a criterion over and above what can be predicted by other sources of data,” should be assessed.

Application

If you had to read one article on scale generation, this is probably the one that I’d recommend. It gives a higher level picture than the Hinkin article and provides a better review of current research than does their 1995 article.

This post is licensed under CC BY 4.0 by the author.

Brickman 1978 - Lottery Winners and Accident Victims - Is Happiness Relative?

Cohen 1983 - A Global Measure of Perceived Stress

Comments powered by Disqus.