Modeling General, Specific, and Method Variance in Personality Measures: Results for ZKA-PQ and NEO-PI-R

Contemporary models of personality assume a hierarchical structure in which broader traits contain narrower traits. Individual differences in response styles also constitute a source of score variance. In this study, the bifactor model is applied to separate these sources of variance for personality subscores. The procedure is illustrated using data for two personality inventories—NEO Personality Inventory–Revised and Zuckerman–Kuhlman–Aluja Personality Questionnaire. The inclusion of the acquiescence method factor generally improved the fit to acceptable levels for the Zuckerman–Kuhlman–Aluja Personality Questionnaire, but not for the NEO Personality Inventory–Revised. This effect was higher in subscales where the number of direct and reverse items is not balanced. Loadings on the specific factors were usually smaller than the loadings on the general factor. In some cases, part of the variance was due to domains being different from the main one. This information is of particular interest to researchers as they can identify which subscale scores have more potential to increase predictive validity.

most common scenario, a raw facet subscale score is obtained as a sum of item responses. When we do this, we cannot differentiate between facets where variance is explained only by the theoretically expected domain (i.e., pure facet scores) and facets where variance is explained by more than one domain (i.e., blended facet scores). This is why several authors (e.g., Anglim & Grant, 2014;Salgado, Moscoso, & Berges, 2013) consider that incremental validity should be shown for the residualized facet scores instead. Likewise, but less acknowledged, the relationship between the residualized facet score and one external criteria might still not be due to the narrow trait but to specific subset of items or, in the case of self-report external measures, to method artifacts such as the acquiescence bias. Accordingly, Danner, Aichholzer, and Rammstedt (2015) have recently shown that acquiescence bias is moderately stable over time, is consistent across different question types, and can bias the relationship with other variables.
Taking all of the above into account, we propose that the analyses of the internal structure of multidimensional questionnaires through multidimensional bifactor models that also incorporate an acquiescence method factor can be useful to separate several sources of variance (general, specific, and acquiescence bias) for the scores of each facet score. In addition, while the question about the usefulness of narrow traits should be answered empirically for each individual case (Ashton, Paunonen, & Lee, 2014), we suggest here that an upper limit for the magnitude with which each residualized facet score can predict external criteria can be determined by analyzing its specific variance. These proposed analyses are carried out in the present study for two personality inventories: NEO-PI-R and the ZKA-PQ. In the next two sections, we provide details about the bifactor model and the procedures for modeling acquiescence bias.

The Bifactor Model
Application of bifactor models has increased dramatically in the past 10 years. Nowadays, the use of bifactor models is increasingly more habitual in a variety of fields, such as intelligence (Canivez & Watkins, 2010a, 2010bGignac & Watkins, 2013), antisocial behavior (Tackett, Daoud, De Bolle, & Burt, 2013), and psychopathy (Patrick, Hicks, Nichol, & Krueger, 2007). In the personality field, Chen, Hayes, Carver, Laurenceau, and Zhang (2012) have illustrated the application of bifactor models to the extraversion domain of the NEO-PI-R in order to test its multifaceted structure.
Bifactor models (see Figure 1C) include a general factor (e.g., broad personality trait) on which all the items load and several orthogonal specific factors (e.g., narrow personality traits) that represent the common variance that it is not explained by the general factor. Thus, the item common variance is decomposed directly into general and specific common variance. Taking the factor structure of the ZKA-PQ as an example, the variance of the facet scales depends simultaneously on the broader (e.g., neuroticism) and the narrower (e.g., anxiety, depression) latent constructs. We can explore which proportion of the subscale scores is due only to the narrower latent trait using the omega reliability coefficient of the subscale (ω s ; Reise, Bonifay, & Haviland, 2012), which is a reliability estimate for the residualized facet score (i.e., after subtracting the effects of the broader domain factor). Each broad domain scale is assumed to tap a unique broad factor (e.g., neuroticism).

Modelling Acquiescence
It is generally agreed that individual differences in the response style are an important nuisance factor that can have systematic effects on the item covariance structure and can constitute an important source of misfit (e.g., Danner et al., 2015;McCrae, Herbst, & Costa, 2001;Podsakoff, MacKenzie, & Podsakoff, 2012;Rammstedt & Farmer, 2013;Soto, John, Gosling, & Potter, 2008). Thus, it is interesting to find ways of modeling this source of variance so that the model fit can be improved. One of the response styles that can cause distortions in the assessments is acquiescence bias. Acquiescence (disacquiescence) represents the preference for the positive (negative) side of the rating scale (Weijters, Baumgartner, & Schillewaert, 2013). This response style produces inconsistent responses to direct (i.e., positively worded) and reversed (i.e., negatively worded) items. Recently, Savalei and Falk (2014) compared different methods for dealing with acquiescence bias. Savalei and Falk suggest that the best way of modeling acquiescence is by adding a general method factor in the confirmatory factor analysis (CFA) context. The model then incorporates an additional general factor that is orthogonal to all substantive common factors. A person with a high level in this acquiescence method factor is expected to endorse all items no matter what their content. To model this process in a database where items have not been recoded, the factor loadings on this method factor are all fixed to 1. This is equivalent to setting the loadings to 1 for the direct items and to −1 for the reversed items in a recoded database. The parameter of interest to be estimated is the variance of the method factor. This variance is an indicator of the size of the acquiescence bias. It must be noted that in this model all method factor loadings are set to be equal for model identification.
In this sense, this model intents to capture an individual's tendency to use the response categories in a consistent manner across items but idiosyncratically among individuals. The parameter recovery of this CFA method has been shown to be robust to the violations of these implicit assumptions (Savalei & Falk, 2014).
In short, the main goal of this study is to decompose personality items variability into its relevant components (i.e., domain, narrow, and acquiescence factors). To do so, a factorial structure must be assumed. This is not an easy task because, as it occurs with the NEO-PI-R and ZKA-PQ, personality tests are frequently composed of a large number of items and dimensions. For example, the NEO-PI-R has 5 primary dimensions and 30 narrow factors that are supposed to account for the variance of its 240 items. This large number of items and dimensions preclude modeling all the items in the same analysis. Thus, here we propose two levels of analysis. First, each domain will be analyzed separately at the item level. Second, the number of variables will be reduced using item parceling (Marsh, Lüdtke, Nagengast, Morin, & Von Davier, 2013). This allows all domains to be modeled at the same time, which is of particular interest because several facets scores may be interstitial (i.e., influenced by more than one personality domain). If all domains are not modeled within the same analysis, it may happen that the variance attributed to a facet (e.g., impulsiveness) that has a positive secondary loading on a broad domain other than the theoretically expected one (e.g., extraversion) will be overestimated. On the other hand, if two facets within the same broad domain have opposite loadings in another broad domain factor, the correlation between them will be suppressed, and this may also reduce their loading on the intended broad domain factor. Taking these limitations into account, we decided to explore these two levels of analysis. In view of all the above, the specific goals of the present study were (a) to test whether the inclusion of an acquiescence method factor substantially improves the fit of the model at the item and parcel levels and (b) to assess whether specific latent variance can be reliably measured through facet scores at the item and parcel levels.

Method
Participants and Procedure García, Escorial, García, Blanch, and Aluja (2012) originally used the data for this study to explore the convergent and discriminant validity of the ZKA-PQ. This database includes the responses of 653 persons (317 men, 336 women) with a mean age of 44.9 years (SD = 17.16) to the ZKA-PQ and NEO-PI-R questionnaires. The sample was distributed among the following age ranges: 18 to 30 years, 20.7%; 31 to 40 years, 19.8%; 41 to 50 years, 20.7%; 51 to 60 years, 20.2%; and older than 60 years, 18.7%. Undergraduate students cooperated in the data collection of this study for course credit. The students had instructions to obtain one man and one woman from the following age subgroups: 18 to 30 years, 31 to 40 years, 41 to 50 years, 51 to 60 years, and >60 years. Subjects were anonymous volunteers recruited among the general population.

Instruments
The ZKA-PQ (Aluja et al., 2010) is composed of 200 items with a 4-point Likert-type response format ranging from 1 (strongly disagree) to 4 (strongly agree). This instrument measures four facets for each of Zuckerman's alternative five factors of personality: aggressiveness (physical aggression, verbal aggression, anger, and hostility), sensation seeking (thrill and adventure seeking, experience seeking, disinhibition, and boredom susceptibility/impulsivity), activity (work compulsion, general activity, restlessness, and work energy), extraversion (positive emotions, social warmth, exhibitionism, and sociability), and neuroticism (anxiety, depression, dependency, and low self-esteem). In the original study, the authors found a robust five-factor structure in two Spanish (calibration and validation) and American samples. Factorial congruency coefficients were always higher than .98.
In a previous study using the same data, García et al. (2012) found that 5 out 20 facet subscales of the ZKA-PQ showed Cronbach's alpha coefficients lower than .70, and the five-factor scales had alpha values around .90. In the case of the NEO-PI-R, most of the facet subscales (26 out of 30) had reliability coefficients lower than .70, while the five factor scales had values between .83 and .87.

Statistical Analyses
Item-Level Analysis: Modeling Each Domain Separately. Following the theoretical structure of the questionnaires, each broad domain scale is assumed to tap a unique broad factor. A series of seven models were formulated and tested for each personality domain separately: (a) unidimensional model (i.e., all the items load only on a general factor), (b) oblique factor model (i.e., items of each facet load on different correlated specific factors), (c) bifactor model (i.e., each item loads simultaneously on the general factor and its corresponding specific factor). For each of these models, an acquiescence method factor was also tested. For this method factor, loadings of the direct items were fixed to +1, loadings of the reversed items were fixed to −1, and the variance of the factor was estimated. This is the common procedure for testing the effect of acquiescence bias when the reverse items are recoded (Cai, 2010;Maydeu-Olivares & Coffman, 2006;Savalei & Falk, 2014). Finally, a seventh model was tested for each domain where the nonsignificant loadings on the specific factors were set to zero. The ZKA-PQ and NEO-PI-R were analyzed separately. To illustrate these models, the representation for the neuroticism domain of the ZKA-PQ is depicted in Figure 1.
All models were estimated using Mplus 7 (L. K. Muthén & Muthén, 1998-2012 and the weighted least squares estimator, which is recommended for categorical data (B. Muthén, Du Toit, & Spisic, 1997). We used three goodnessof-fit statistics for model evaluation: the chi-square statistic, comparative fit index (CFI), and the root mean square error of approximation (RMSEA). The conventional cutoff values for the CFI are .90 or greater for acceptable fit and .95 or greater for good fit (Hu & Bentler, 1999). RMSEA values between .05 and .08 represent an acceptable fit, whereas values lower than .05 indicate a good fit (McDonald & Ho, 2002). Direct statistical comparisons among nested models 1 were computed. The interpretation of the chi-square difference test was based on the p values of the DIFFTEST option results in Mplus. In addition, nested and nonnested models were compared using the differences in CFI. A difference in CFI of .002 or less is typically adopted as evidence that the imposition of additional constraints does not lead to a significant loss of fit (Meade, Johnson, & Braddy, 2008).
As stated previously, the purpose of the study was to separate the sources of variance for each facet score. That is, to obtain the percentage of variance due to the general factor, the specific factor, and the acquiescence method factor. This can easily be done with the application of the bifactor model. The variance of a facet sum score, X f , can be defined as where m is a substantive factor (broad or narrow factor), V m is the variance due to factor m, V AQ is the variance due to the acquiescence (AQ) method factor, and V e is the unique variance (i.e., uniqueness). Here follows the formulation of each of these elements: In these equations, the sum is made for items included in the facet ( ) j f ∈ , λ jm is the loading of item j on the factor m, Var(F  variance of item j. By dividing each V m term by Var(X f ), the proportion of variance due to factor m can obtained.
Parcel-Level Analysis: Modeling All Domains at the Same Time. Two parcels were constructed for each facet (first and second halves). Then, we applied exploratory structural equation modeling (ESEM; Asparouhov & Muthén, 2009) at the parcel level using the maximum likelihood estimator of Mplus. Unlike exploratory factor analysis, ESEM models can include both exploratory and confirmatory methods (e.g., equality constraints, fixed parameters, and correlated error terms). In our analyses, we defined five correlated ESEM factors corresponding to the five domains (see Figure 2 where the model is depicted). An oblique semispecified target rotation was defined for these factors (i.e., zero loadings were defined according with the theoretical model). Additionally, the inclusion of two CFA elements was tested: (a) a specific narrow factor for parcels within the same facet and (b) an independent acquiescence factor. Regarding the specific narrow factor, loadings of the parcels were constrained to be equal for identification purposes. As in the item-level analysis, loadings on the acquiescence factor were fixed so that they represented the direction of the variables. As far as we know, this is the first time that the acquiescence CFA model is applied at the parcel level. However, its application to item parcels is straightforward. Let us consider a parcel (P) composed of three indicators (two direct items-I 1 and I 2 -and one reversed item-I 3 ). As in the case of the facet sum score, each item has different contributing sources of variation Note. ZKA-PQ = Zuckerman-Kuhlman-Aluja Personality Questionnaire. Two five-item parcels are constructed for each independent specific factor (i.e., facets). There are six primary dimensions, namely, the correlated Big Five personality factors and the acquiescence method factor. All these primary dimensions are allowed to load on all items. P1, . . . , P40 represent the parcels; AGG1, . . . , NEU4 represent the specific factors; AGG, SES, ACT, EXT, and NEU represent the five domain factors; and AQ represents the acquiescence method factor.
where λ jm is the loading on the substantive common factors (F m ), F AQ is the acquiescence method factor, and e are the uniqueness. When these three items are grouped into a parcel, the equation at the parcel (P) level is where the loading on the acquiescence method factor is expected to be 2. In this model, variance of facet scores (i.e., X f = P 1 + P 2 ) can be obtained as As pointed out above, V m , V AQ , and V e represent the substantive, acquiescence, and unique variances. What is new here is the inclusion of the covariance among the substantive factors (i.e., V m ′ m ). All these elements are defined as where λ 1m and λ 2m are the loadings on the factor m for the two parcels composing each facet, D is the number of direct items, R is the number of reversed items, V m is the parcel variance due to the factor m, V mm′ is the parcel variance due to the covariance between factors m and m′, and V e is the unique variance. The relative contribution of variance due to these terms can be obtained by dividing each by Var(X f ). There are two points that should be made here. First, perfectly balanced parcels (i.e., the number of positively and negatively keyed items is the same) have a zero loading on the acquiescence method factor. Second, the contribution of the covariance between factors m and m′ can be negative (i.e., if the facet score has loadings of opposite sign on positive correlated factors or if the parcel has loadings of the same sign on negative correlated factors). Finally, an estimation of the facet score reliability can be obtained as (for a general formulation, see Raykov & Marcoulides, 2011) r

Results
Results will be reported in two sections: one dealing with item-level CFA models for each domain separately and one dealing with parcel-level analyses modeling all domains at the same time. An example of the Mplus syntax for the item-level CFA models as well as full tables for all item loadings for the final item-level CFA models are provided in an online supplement (available online at http://asm. sagepub.com/supplemental). The rest of the materials can be requested from the corresponding author. Table 1 shows the goodness-of-fit statistics for all models for the ZKA-PQ. Before including the acquiescence method factor, the RMSEA indexes were close to being acceptable (<.09) for all the models, but the CFI was only acceptable for the bifactor model in the case of the aggressiveness scale (CFI = .92). In almost all the cases, the unidimensional model showed the worst fit and the bifactor model showed the best fit. The only exception was the activity scale, for which the oblique model showed the best fit. The inclusion of the acquiescence method factor improved the fit of the bifactor models to acceptable levels, except for the activity scale (CFI = .85). Removing the nonsignificant loadings on the specific factors did not worsen the fit values, according to the RMSEA and CFI indexes. Considering the difference in CFI, the bifactor model with the zero-loading constraints produced a better fit than to the rest of the models (ΔCFI < −.002), except for the case of the bifactor model (ΔCFI > −.002). As mentioned above, the only exception was the activity scale. In this sense, the oblique model with and without the acquiescence method factor obtained the best fit (ΔCFI =.003 and.027, respectively). Results for the chi-square difference test led to the same conclusions. Table 2 shows the average item loadings on the general and specific factors in the final bifactor model. The loadings on the general factor were significant for almost all the items (193 of 200), and the average item loadings varied from .39 (activity) to .47 (aggressiveness). On the contrary, there were 34 out of 200 nonsignificant loadings on the group factors. Average item loadings on the specific factors were usually smaller than the average item loadings on the general factor, except for physical aggression, exhibitionism, restlessness, and work energy. The standardized factor loadings on the acquiescence method factor were significant and varied from .17 (aggressiveness) to .22 (neuroticism).

Item-Level Analysis: Modeling Each Domain Separately
For the five domain scales, Table 3 shows that the percentage of variance due to the general factor ranged from 72% (activity) to 85% (sensation seeking). For the five domain scores, the percentage of common variance explained by the acquiescence method factor was low (smaller than 4%). On the other hand, for the subscales, the percentage of variance due to the specific factor was usually lower than what was explained by the domain factor. The highest values were found for exhibitionism (63%), physical aggression (57%), work energy (45%), work compulsion (38%), restlessness (38%), and thrill and adventure seeking (36%). High values mean that these subscales have a high degree of specificity after removing the common variance due to the general factor. The percentage of common variance explained by the acquiescence method factor was usually low (smaller than 3%) except for the boredom susceptibility/impulsivity and the work compulsion subscales (5% and 9%, respectively). Not surprisingly, the number of direct and reversed items is not well balanced in these two subscales. Table 4 shows the goodness-of-fit statistics for all models for the NEO-PI-R. The pattern of results is similar to that obtained for the ZKA-PQ. Before including the acquiescence method factor, the RMSEA indexes were acceptable (<.08) for all the models, but the CFI was unacceptable (CFI < .80). The unidimensional model showed the worst fit, and the bifactor model showed the best fit. Again, the inclusion of the acquiescence method factor increased the fit (although not to acceptable levels in this case). CFI values were close to .90 for conscientiousness and openness scales (.88 and 87, respectively), and the worst fit was found for the agreeableness scale (.82). Removing the nonsignificant loadings on the specific factors did not worsen the fit of the bifactor model. When examining the difference in CFI, the pattern of results was similar to the one observed for the ZKA-PQ. In all cases, the bifactor model with the zero-loading constraints obtained a better fit than all the models (ΔCFI < −.002) with the exception of the bifactor model (ΔCFI > −.002). These results are congruent with the ones obtained for the chi-square difference test. Differences against the final bifactor model with the acquiescence factor were generally nonsignificant (p > .05), with the exception of openness, χ 2 (12) = 27.2, p = .007, and conscientiousness, χ 2 (18) = 45.7, p = .0003, scales. Table 5 shows the average item loadings on the general and specific factors in the final bifactor model. The loadings on the corresponding broader domain factor were significant for almost all the items (237 of 248), and the average item loadings varied from .31 (extraversion) to .39 (conscientiousness). Regarding the loadings on the specific factors, 62 out of 248 were nonsignificant. Average item loadings on the specific factors were usually smaller than average item loadings on the general factor, except for the extraversion domain (i.e., gregariousness, assertiveness, activity, and excitement seeking subscales) and some subscales in other domains (e.g., impulsiveness, fantasy, values, trust and deliberation). Factor loadings on the acquiescence method factor were significant and varied from .21 (extraversion) to .25 (neuroticism). For the five domain scales, Table 6 shows the percentage of variance due to general and narrow factors for scale and subscale scores. The percentage of variance explained by the general factor ranged from 74% (extraversion) to 85% (neuroticism and conscientiousness). The percentage of common variance explained by the acquiescence method factor was low (smaller than 1%). On the other hand, the percentages of variance due to the specific factors were usually small. The highest values are found for excitement seeking (52%), assertiveness (49%), gregariousness (48%), deliberation (48%), trust (44%), and fantasy (44%) subscales. Despite this, at the same time, there were other subscales where the percentage of common variance explained by the specific factor was smaller than 5% (vulnerability, positive emotions, tender-mindedness, and self-discipline). The percentage of common variance explained by the acquiescence method factor was usually low (smaller than 4%). The acquiescence method factor did have some effect in the scales in which only two items out of eight were reversed (depression, warmth, excitement seeking, tendermindedness, and dutifulness). Table 7 shows the goodness of fit for all the tested models for the ZKA-PQ and the NEO-PI-R. In both the cases, the oblique ESEM model does not fit the data satisfactorily but including facet factors or additionally including the acquiescence factor improved the model significantly (p < .05 for the chi-square difference test in all cases). The fit indexes for the final models were acceptable for the ZKA-PQ (RMSEA = .05 and CFI = .93) and close to being acceptable for the NEO-PI-R (RMSEA = .04 and CFI = .89). For the final models in the ZKA-PQ and NEO-PI-R, the average factor loadings of parcels were higher on their corresponding domain factor than in their specific facet factor (e.g., for the ZKA-PQ, .60 vs. .32; for the NEO-PI-R, .48 vs. .31). The subscales with higher factor loadings on the specific factor than in the general factor were almost the same as those found in the item-level analyses (e.g., subscales in the extraversion domain of the ZKA-PQ). Tables 8 and 9 show the relative contribution of each source of variance to domain and facet scores variance. Regarding the domain scores, the percentage of variance due to the domain latent factor was usually high (e.g., ranging from 71% to 80% for the ZKA-PQ, and ranging from 55% to 78% for the NEO-PI-R) and the contribution of facets specificity was low but not negligible (larger than 6% in almost all the scales). Domain scores with a high degree of specificity were activity (in the ZKA-PQ) and extraversion (NEO-PI-R), whereas for neuroticism (for both questionnaires) and conscientiousness (in the NEO-PI-R) the effect of the facets was very small.

Parcel-Level Analysis: Modeling All Domains at the Same Time
Regarding the facet scores, the highest specificity values were obtained for psychical aggression (42%), work    compulsion (41%), exhibitionism (40%), work energy (28%), thrill and adventure seeking (25%), and experience seeking (17%) of the ZKA-PQ, and for gregariousness (41%), trust (30%), excitement seeking (29%), deliberation (29%), values (28%), and order (26%) of the NEO-PI-R. A low specificity was obtained for depression (0%), hostility (3%), anger (4%), sociability (6%), boredom susceptibility (6%), and disinhibition (7%) of the ZKA-PQ, and for angry/ hostility (0%), dutifulness (0%), feelings (1%), self-discipline (1%), vulnerability (3%), and warmth (4%) of the NEO-PI-R. These results largely replicate those obtained at the item level (the correlation between specificity obtained at the parcel and item levels was .81 and .70, for the ZKA-PQ and the NEO-PI-R, respectively). However, there were also some differences. For some subscales (e.g., angry/hostility, warmth, assertiveness, impulsiveness in the NEO-PI-R), the estimated specificity was smaller when it was computed at the parcel level. This is due to the interstitial nature of these facet scales. An important part of their variance is due to different domains (e.g., angry/hostility, warmth, and assertiveness are also indicators of agreeableness and impulsiveness is also an indicator of extraversion and conscientiousness). When the domains were analyzed separately at the item level, the effect of other domains on the facet scores was not tested. For example, at the parcel level, the reliability for angry/hostility (≈64) can be decomposed in several sources: (a) related to the broad domain of neuroticism (43), (b) related to broad domains other than neuroticism (17), (c) due to the covariances between neuroticism and the other domains (3), (d) due to the covariances that do not include neuroticsm (−1), (e) related to facet specificity (0), and (f) related to acquiescence factor (1). In this case, subscale loading was positive on neuroticism (.65) and negative on agreeableness (−.41). Given that these loadings are high, the relative contribution of these domains is high. In addition, these two domains are negatively correlated (r = −.09).
Thus, there is also an effect due to the covariation between these broad dimensions. This contribution would be small and positive because of this low negative covariation and the fact that the loadings on both broad dimensions are opposite in sign. In comparison to the results for the item-level analysis, we find that the contribution of the specific factor to the variance was nonnegligible at the item level when the neuroticism domain was analyzed separately (28), but this contribution was reduced to 0 at the parcel level, as this variance was due to other domains. Another divergent result was obtained for some facets (e.g., positive emotions and tendermindedness in the NEO-PI-R) where specificity was higher according to the new analyses (4 vs. 19 and 0 vs. 14). Finally, results regarding the percentage of variance due to the acquiescence factor were similar to those obtained at the item level (e.g., work compulsion in the ZKA-PQ and tendermindedness in the NEO-PI-R showed the largest effect of acquiescence).

Discussion
The current study examined different competing measurement models for personality domains by comparing a bifactor model with unidimensional and oblique factor models. Neither the unidimensional model (assuming only a general factor) nor the oblique factor model (assuming only specific factors) fit the data well. The bifactor model was the best fitting model. However, it should be noted that the fit of the final bifactor model was not good in some cases. This is not uncommon in the confirmatory analysis of personality questionnaires and it is one of the main reasons why it has been argued that the use of CFA to analyze the structure of personality questionnaires at the item level should be limited. Indeed, only short versions such as the NEO Five-Factor Inventory (NEO-FFI; Costa & McCrae, 1992) have been often analyzed with CFA at the item level (e.g., Rolland, Parker, & Stumpf, 1998), showing generally poor levels of fit (e.g., Egan, Deary, & Austin, 2000;Gignac, Bates, & Jang, 2007;Panayiotou, Kokkinos, & Spanoudis, 2004;Schmitz, Hartkamp, Baldini, Rollnik, & Tress, 2001). In fact, McCrae and Costa (2004) revised the item composition of the NEO-FFI in response to this increasing criticism, although this revised version (i.e., NEO-FFI-R) does not clearly improve the psychometric structure of the former one (Aluja, García, Rossier, & García, 2005). However, it should be noted that in these short versions, facets are usually ignored, which can partially contribute to the model misfit (Gignac et al., 2007). Thus, some authors (McCrae, Zonderman, Costa, Bond, & Paunonen, 1996) have argued that CFA tests of hypothesized factor structures can be misleading. Hopwood and Donnellan (2010) consider that this bad fit should not be worrying, taking into account the considerable evidence for the criterion-referenced validity of these questionnaires, nor it should be surprising as it is difficult to write "perfect" items to assess personality. Sources of misfit can be the minor factors (or correlated errors), which easily appear when the content or the phrasing of two items is similar, or might be due to other methodological artifacts (e.g., negatively worded items sharing variance above and beyond the general factor). On the other hand, correcting these sources of misfit can be difficult because additional parameters (e.g., based on modification indexes) might not replicate on cross-validation studies (Hopwood & Donnellan, 2010). Indeed, these authors also propose that broad conventions to provide "thumbs-up or -down" rules regarding overall model fit should be avoided. When scales are analyzed with CFA at the facet level, similar difficulties have been found. For example, Hopwood and Donnellan (2010) analyzed seven multiscale instruments, including the NEO-PI-R, with exploratory factor analysis and CFA and found considerable low fit indexes in CFA (e.g., for the NEO-PI-R the CFI index was .61). In this case, two explanations for misfit can be considered: (a) cross-loadings are excluded in CFA, whereas some subscales (e.g., impulsiveness) may be interstitial, loading on two or more broad domains and (b) correlated errors for subscales within a broad domain (e.g., depression and vulnerability) might also share some variance above and beyond the general factor. Accordingly, in this study we found a reasonable fit for the final model, applied at the parcel level, in which cross-loadings were allowed by using ESEM. Fit for the ESEM model might still be improved as facets within a broad domain (e.g., depression and vulnerability) might still share some variance above and beyond the general factor. However, we have avoided these ad hoc modifications here because they usually do not cross-validate to other samples (e.g., Hopwood & Donnellan, 2010).
Other sources of misfit are individual differences in response style. The two most frequently studied response styles are acquiescence bias and social desirability bias. In the literature, different strategies have been proposed to deal with these biases and the inclusion of method factors has emerged as an optimal solution. Modeling social desirability is more complex because it requires using a set of social desirability item markers (see, e.g., Ferrando, Lorenzo-Seva, & Chico, 2009). People are more likely to respond in a socially desirable manner in high-stakes assessments. In the present study, the data were collected from the general population in a low-stakes context. Thus, a social desirability questionnaire was not administrated and we focused on acquiescence bias. We modeled this bias by including an additional general acquiescence method factor in the CFA context following the procedure exposed by Savalei and Falk (2014). In the present work, we have extended this procedure to applications in the ESEM context that are implemented at the parcel level. The inclusion of this factor improved the fit of all the subscales. This result is consistent with other studies that include method factors (e.g., Biderman, Nguyen, Cunningham, & Ghorbani, 2011;Marsh, 1996). Thus, there is increasing evidence that response style produces item covariance that is not explained by the substantive trait factors. This effect is stronger for some scales in which the number of direct and reversed items is not balanced (e.g., tender-mindedness in the NEO-PI-R and work compulsion in the ZKA-PQ). Another important reason for modeling acquiescence is that it can produce bias in the correlation with other self-report criteria (Danner et al., 2015). All things considered, it is crucial to stress again the importance of modeling response styles. Depending on the characteristics of the assessment (e.g., high-stakes vs. low-stakes, motivation), they may have a great impact on the validity of the test scores. Fortunately, new methods and software have been developed for dealing with social desirability and acquiescence style responses (see, e.g., Navarro, Vigil Colet, Ferrando, & Lorenzo-Seva, 2016).
In this final parcel factor model, we computed the variance due to the narrower traits after the variance in the broad domains was partialized out (i.e., specificities). Results show that there is a large variability in these specificities, depending on the domain and the questionnaire. These results can be useful for the researchers as it means that the use of some subscale scores (instead of the total score in the domain) has more potential to increase predictive validity. Additionally, for some facets (e.g., angry/ hostility, impulsiveness), specificity increases if the variance due to broad domains other than main one is not controlled. Thus, we can hypothesize that incremental validity of these interstitial parcels has room to be larger when other broad domains are not included as predictors. Here follows a summary of the main conclusion for the different domains: • • Regarding the neuroticism domain, there was no facet with high specificity. However, higher potential for incremental validity can be expected for the most interstitial facets, impulsiveness, and angry/hostility in the NEO-PI-R, with a large part of variance related to other broad domains (e.g., impulsiveness was positively related to extraversion and negatively with conscientiousness and angry/hostility was negatively related with agreeableness). Interestingly, these facets have previously differentiated from the remaining as representing a subfactor labeled volatility externalizing problems of disinhibition, difficulties for controlling impulses, and irritability (DeYoung, Quilty, & Peterson, 2007). • • For the extraversion domain, many facets were highly specific (e.g., gregariousness), highly interstitial (e.g., warmth, positive emotions), or both (e.g., assertiveness, excitement seeking, exhibitionism). Thus, there is a large potential for incremental validity in this domain, especially when other broad domains are not included as predictors. • • For the conscientiousness domain, large specificities were found only for order and deliberation, the last being highly interstitial. These results suggest that higher incremental validity can be expected from the inclusion of these facets. The differentiation between these facets and the others is consistent with Roberts, Chernyshenko, Stark, and Goldberg (2005), who, in a factor analysis of a large number of conscientiousness-related subscales, found that order and deliberation loaded both in separated factors (order and self-control), different to the factor in which the remaining NEO measures loaded (industriousness). For the activity domain in the ZKA-PQ, large specificities were found only for work compulsion and work energy, which is highly interstitial, whereas the lowest specificity was found for general activity.
• • For the agreeableness domain, several facets were highly specific (e.g., straightforwardness, modesty), highly interstitial (e.g., altruism), or both (e.g., trust). According to DeYoung et al. (2007), altruism and tender-mindedness subscales measure a compassion subfactor (i.e., compassionate emotional affiliation with others), whereas straightforwardness, modesty, and compliance measure politeness (i.e., consideration of and respect for others' needs and desires). Thus, results regarding the high specificity of straightforwardness and modesty are not expected. DeYoung et al. (2007) and Ashton and Lee (2005) suggest that these facets are markers of honesty/ humility dimension in the HEXACO model of personality, which is not well represented in the NEO-PI-R. For the aggressiveness domain in the ZKA-PQ, a large specificity was found for physical aggression and the most interstitial facet was hostility. • • For the openness domain, large specificities were found for fantasy, aesthetics, and values. The most interstitial facet was feelings. Attending to the taxonomy of Woo et al. (2014), fantasy, values, and aesthetics would refer to openness to aesthetic, cultural, and self-transformation experiences, whereas ideas is an indicator of intellect, reflecting openness to new intellectual stimulations and should be more connected with plasticity and cognitive behavior. All these facets seem to have some potential, whereas no large gains of incremental validity are expected with the feelings and actions subscales. In the case of sensation seeking domain of the ZKA-PQ, low values of specificities were found for disinhibition and boredom/susceptibility.
The bifactor approach is a useful tool that effectively partials out that common variance (Anglim & Grant, 2014) and provides information about the specific variance of the facet (i.e., the facet score variability that cannot be predicted from the remaining facets in the domain). In short, a low value for specific reliability sets an upper limit for incremental validity. However, it must be noted that the estimation of the variance due to specificity depends on the level of analysis. For example, the percentage of variance due to specificity in the warmth subscale was 4%. This is equivalent to saying that the subscale reliability, that is, the proportion of variance due to the relevant factor, is .04. However, if the analyses were made for each domain separately, the percentage of variance due to specificity would be 28% (1% + 23% + 4%), indicating that the subscale reliability is .28. This means that in this case the incremental validity is expected to be larger when the broad domains other than the main domain are not included as predictors.
When presenting these results, it should be noted that the specificity estimates that we obtained are smaller than the indirect estimates obtained by Costa and McCrae (1995). These authors submitted the correlation matrix of the facet scales of the NEO-PI-R to a principal component analysis and then estimated the specificity values as the difference between the alpha coefficient and the communality. They found that 26 out of the 30 scales had specificities greater than .30. In our case, low values of specificity are found for some subscales (e.g., angry/hostility, depression) in comparison with their results. This does not necessarily mean that facets are irrelevant or unreliable. Compared with residualized facet scores, raw facet sum scores are more easily interpretable, reliable, and correlated within the same domain. In this sense, there is some evidence that favors, to some extent, the use of raw facet sum scores to increase predictive validity. For example, Aluja, Blanch, García, García, and Escorial (2012) found that the average R 2 prediction of personality disorder scale scores increased from .33 to .37 when ZKA-PQ subscales scores were used instead of broad domain scores. In other studies using the NEO-PI-R, Aluja, Cuevas, García, and García (2007) and Dyce and O'Connor (1998) reported the average R 2 in Spanish and American samples, respectively, when predicting personality disorders according to the NEO-PI-R domains and facets. The percentage of variance predicted after including domains was 35%, whereas the prediction after including facets improved that value by 3% and 4% for Spanish and American samples, respectively. Thus, the usefulness of facets is open to debate, and in the fields of personality and work performance, there has been much controversy about the incremental validity of facets. Some authors favor the use of facets (e.g., Ashton et al., 2014;Judge, Rodell, Klinger, Simon, & Crawford, 2013;Paunonen, Rothstein, & Jackson, 1999), whereas other authors advise against their use (e.g., Ones & Viswesvaran, 1996;Salgado et al., 2013). Salgado et al. (2013) showed that a critical point for discussion is to separate common and specific facet variance. In their study, facets of conscientiousness did not predict job performance when the common factor variance (i.e., explained by the general factor of conscientiousness) was excluded. On the other hand, Ashton et al. (2014) found that unique variance of only one theoretically relevant facet (i.e., fairness) did show considerable incremental validity in predicting delinquency beyond and above two broad domains (honesty/humilty and conscientiousness).
Another potential disadvantage of facets is the large number of them that have been defined. As the number of facets included in a criterion validity study increases, the risk of obtaining spurious predictor-criterion relationships also increases (O'Neill & Paunonen, 2013). Although the number of facets can be reduced by taking into account the previous research (Dudley et al., 2006;Paunonen & Ashton, 2001), it would raise a range of issues regarding facet selection (Anglim & Grant, 2014). Analyses as those described here can help the researcher decide which facets can have more potential to increase criterion validity. We have shown that the unreliability of the parcels to measure the specific facets might be a problem for some personality subscales. For example, incremental prediction validity of the facets is typically examined with regression models. Assuming orthogonal predictors, as is the case in the bifactor model approach, the coefficient of determination increase is equal to the squared correlation between the specific factor and the criterion variable. This correlation will be attenuated by the reliability. For example, assuming that the specific factor and the criterion variable correlates .40 and that the reliability for the specific factor scores is .50, this correlation will be reduced to .40√.50 = .28 and thus the increase would be reduced from .40 2 = .16 to .28 2 = .08. On the other hand, if one facet has a predictive power beyond and above what is expected by their reliability, some concerns can arise regarding the nature of the obtained relation, because it might be attributed to some specific items of the facet and not to the expected construct. In this sense, McCrae (2014) argues that the correlation of outcomes may reflect the effect of the domain (e.g., neuroticism), the narrower trait (e.g., anxiety), or something more specific (e.g., apprehension). In the current work, we illustrate how the effect of different sources of variance can be established using factor analysis techniques.
In the current study, we showed that the analysis of internal structure provides some insights into the reasons why a subscale might be related to external criteria. In a latent variable approach, reliability, internal structure, and validity are structurally related. In a different view, McCrae, Kurtz, Yamagata, and Terracciano (2011) conclude that internal consistency of scales can be useful as a check on data quality, but appears to be of limited utility for evaluating the potential validity of developed scales (see also McCrae, 2014). More research is needed to test if their conclusions are generalizable to several external criteria and other reliability indexes different from the alpha coefficient, which does not distinguish between different sources of common variance.
Assuming that reliability measures in factor analysis are important for validity, the key question seems to be how to increase the reliability of some subscale scores. One way to do this is to increase the length of the subscales. However, if we do that, the number of facets per domain should be reduced, since otherwise the test would be too long, and it would result in a reduction of the bandwidth of the focus. The bifactor model results can also be useful in guiding the selection of items that increase the reliability of the intended factor (Stucky & Edelen, 2014). For example, items with a higher factor loading on the specific factor relative to the factor loading on the general factor might be better. Another promising approach is to use item response theory to develop a computerized adaptive personality test. In an adaptive test, each person responds to a different set of items (those that are more informative to measure this person's latent trait level). Recently, Makransky, Mortensen, and Glas (2012) have shown that the reliability of the NEO PI-R facets could be substantially improved by applying a multidimensional computerized adaptive testing. For this reason, the use of adaptive tests based on multidimensional models, such as the bifactor model, is a promising area of research (Seo & Weiss, 2015).

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.