TDA Difficulty

Please cite these data with the following reference:

Condon, D. M., Coughlin, J., & Weston, S. J. (2022). Personality Trait Descriptors: 2,818 Trait Descriptive Adjectives characterized by familiarity, frequency of use, and prior use in psycholexical research. Journal of Open Psychology Data.

The primary goal of this project was to identify the average “difficulty” of 2,818 trait descriptive adjectives (TDAs) in American English. Difficulty, in this usage, denotes the extent to which people in our sample correctly matched the personality-relevant definition for each adjective to the adjective itself when presented as one of six options. The other five options presented as choices were randomly sampled from the remaining 2,817 adjectives in the set. To control for inevitable error resulting from similarity between distractor choices and the matching definition-adjective pair, each TDA was administered twice, with two different sets of distractors. On this page, we report the total number of responses and the average proportion correct across both forms.

In our view, the total proportion correct reflects both familiarity of the adjective and consensus among raters about its meaning, especially as a descriptor of personality. For example, most of the terms with a low proportion of correct responses are unfamiliar to many individuals (i.e., rarely encountered in everyday language) – “goatish” (.07), “pithy” (.10), “splenetic” (.10). Other terms – “dizzying” (.04), “smooth” (.09) – are familiar but ambiguous in meaning relative to personality. By contrast, terms with the highest proportion of correct responses are familiar and unambiguous – “jolly” (1.00), “thankful” (1.00), “unreliable” (.99). When using ratings of terms such as these to evaluate the structure of personality, it’s important to consider the extent to which each term is known to raters and consistently interpreted.

Statistics from the individual administrations of each term (i.e., each version or “form”) can be found here.

The data can be downloaded here. NOTE that the csv version of these data may mis-read two adjectives when opened in MS Excel: “false” and “blasé”. These must be fixed manually prior to working with the data in Excel (but it appears to work as expected in R).

Then, for reference and subsequent use in other personality projects, we sought to characterize these terms in several additional ways. To evaluate the frequency of usage for each term, there are at least two good options. One is to use the Corpus of Contemporary American English (“COCA”) and the second is to use data openly available from the Google Books project. Though extremely useful, the COCA resource is proprietary so it is not possible to post the frequency of term usage based on this list. The Google Books data is less complete (and has been documented to be problematic in several respects) but publicly accessible. The “GBooks frequency” column below reflects z-scores based on frequency among the first 58,600 or so terms. See this post on StackExchange for more information and attribution.

The three remaining columns reflect the inclusion or exclusion of each term in 3 previously reported sets of trait-descriptive adjectives. Many similar sets could also be included but these three were chosen based on prominence in the psycholexical study of English terms. The columns are dummy coded “1” for inclusion and “0” for exclusion in each set. The sets are the 1,710 terms of Goldberg & Norman (as described here and here), the 435 terms of Saucier & Goldberg (1996), and the 100 terms of the Big Five Factor Markers (Goldberg, 1992). When using any of these subsets, please be sure to reference the original sources.

Please consult the reference listed above for more information about this project.

library(here) # for engaging with working environment
library(Hmisc) # for weighted means
library(DT) # for viewing data tables
library(conflicted)
library(tidyverse) # for data cleaning and manipulation
conflict_prefer("here", "here")
conflict_prefer("filter", "dplyr")
conflict_prefer("count", "dplyr")
conflict_prefer("mutate", "dplyr")
conflict_prefer("summarise", "dplyr")

# Download the data manually from Dataverse using the doi link above
# https://doi.org/10.7910/DVN/5T80PF

# data = read.csv(here("data/TDA_data_scored.csv"))
# masterKey = read.csv(here("data/masterkey.csv"))

# Or use the following lines to scrap the data file in directly
library(dataverse)
library(data.table)

Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
writeBin(get_file("TDA_data_scored.tab", "doi:10.7910/DVN/5T80PF"), "TDA_data_scored.tab")
TDA_data_scored.tab <- fread("TDA_data_scored.tab", na.strings=getOption("<NA>","NA"))
data <- as.data.frame(TDA_data_scored.tab)
rm(TDA_data_scored.tab)

Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
writeBin(get_file("masterkey.tab", "doi:10.7910/DVN/5T80PF"), "masterkey.tab")
masterKey.tab <- fread("masterkey.tab", na.strings=getOption("<NA>","NA"))
masterKey <- as.data.frame(masterKey.tab)
rm(masterKey.tab)

masterKey$adjective <- sub("blas\x8e", "blasé", masterKey$adjective)
masterKey$adjective <- sub("FALSE", "false", masterKey$adjective)

data = data %>%
  dplyr::filter(included == "Yes") %>% # remove participants screend out for demographic and quality reasons
  select(starts_with("q_")) # select only TDA items

data_means = colMeans(data, na.rm = T) #calculate column means

all_item_means = data.frame( # create data frame with...
  item = names(data_means), # ...variable name and...
  prop_correct = data_means #... proportion correct
)

item_prop = masterKey %>% # join the masterKey (item name, correct adjective, and form)
  full_join(all_item_means) %>% # with proportion correct
  mutate(prop_correct = round(prop_correct, 2)) # round to 2 decimal places

# next we count the number of administrations of each item
item_prop$N = colSums(!is.na(data)) # count non-missing

# create table 
item_prop = item_prop %>%
  group_by(adjective) %>% # for each unique adjective
  mutate(weight = N/sum(N)) %>% # create a new variable that is the number of responses (per item) divided by the total number of responses (across both items)
  dplyr::summarise(
    N = sum(N), # how many total responses 
    prop = wtd.mean(prop_correct, weights = weight, na.rm=T) # average proportion, weighted by sample size
  ) %>%
  mutate(prop = round(prop, 2)) # round to 2

frequencies = read.csv(here("data/TDA_frequencies.csv"))

frequencies$adjective <- sub("blas\x8e", "blasé", frequencies$adjective)
frequencies$adjective <- sub("FALSE", "false", frequencies$adjective)

item_prop = item_prop %>% # join the item properties (adjective, N, prop_correct)
  full_join(frequencies) # with the coca.freq, gbooks.freq, and GN1710

item_prop %>%
  DT::datatable( # put into interactive HTML table
    colnames = c("Adjective", "N (total)", "Proportion Correct", "GBooks frequency", "G&N1710", "S&G435", "BFFM100"), #colnames
    options = list(
      lengthMenu = list(c(10, 50, -1), c('10', '50', 'All')),
        pageLength = 10
    ),
    filter = "top", # can filter
    rownames = F # don't need rownames
  )

write_csv(item_prop, file = here("data/TDA_properties.csv"))

# This output file has been posted on Dataverse at the doi provided above.

TDA Difficulty

Last updated 2022-01-04