Item Difficulty

Please cite these data with the following reference:

Condon, D. M., Coughlin, J., & Weston, S. J. (2022). Personality Trait Descriptors: 2,818 Trait Descriptive Adjectives characterized by familiarity, frequency of use, and prior use in psycholexical research. Journal of Open Psychology Data.

This page lists information from each of the two forms used to derive the average proportion correct for each of the 2,818 unique trait descriptive adjectives. See the top of this page for further explanation of the two different forms.

Specifically, the table below contains the item label (e.g., “q_2XXXX”), the trait descriptive adjective (e.g., “shaky”), the total number of respondents for the item (N), the proportion of respondents who correctly matched the term with its definition (e.g., .92), and the form corresponding with these statistics (A or B). Again, each adjective was administered in two different forms (A and B), with different alternative response options across the two forms. The alternative response options were generated randomly from the full pool of adjectives (i.e., the remaining 2,817 options).

Note that we have intentionally omitted the definitions and distractor response options corresponding to each item label. This was done to protect the validity of the items for subsequent use. Contact the lead author of the reference above for access to this information.

The mean proportion correct for each adjective across both forms can be found here.

The data can be downloaded here. NOTE that the csv version of these data will mis-read two adjectives when opened in MS Excel: “false” and “blasé”. These must be fixed manually prior to working with the data in Excel (but it appears to work as expected in R).

Please consult the reference listed above for more information about this project.

library(here) # for engaging with working environment
library(Hmisc) # for weighted means
library(DT) # for viewing data tables
library(conflicted)
library(tidyverse) # for data cleaning and manipulation
conflict_prefer("here", "here")
conflict_prefer("filter", "dplyr")
conflict_prefer("count", "dplyr")
conflict_prefer("mutate", "dplyr")
conflict_prefer("summarise", "dplyr")


# Download the data manually from Dataverse using the doi link above
# https://doi.org/10.7910/DVN/5T80PF

#data = read.csv(here("data/TDA_data_scored.csv"))
#masterKey = read.csv(here("data/masterkey.csv"))

# Or use the following lines to scrap the data file in directly
library(dataverse)
library(data.table)

Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
writeBin(get_file("TDA_data_scored.tab", "doi:10.7910/DVN/5T80PF"), "TDA_data_scored.tab")
TDA_data_scored.tab <- fread("TDA_data_scored.tab", na.strings=getOption("<NA>","NA"))
data <- as.data.frame(TDA_data_scored.tab)
rm(TDA_data_scored.tab)

Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
writeBin(get_file("masterkey.tab", "doi:10.7910/DVN/5T80PF"), "masterkey.tab")
masterKey.tab <- fread("masterkey.tab", na.strings=getOption("<NA>","NA"))
masterKey <- as.data.frame(masterKey.tab)
rm(masterKey.tab)

masterKey$adjective <- sub("blas\x8e", "blasé", masterKey$adjective)
masterKey$adjective <- sub("FALSE", "false", masterKey$adjective)

data = data %>%
  dplyr::filter(included == "Yes") %>% # remove participants screened out for demographic and quality reasons
  select(starts_with("q_")) # select only TDA items

data_means = colMeans(data, na.rm = T) #calculate column means

all_item_means = data.frame( # create data frame with...
  item = names(data_means), # ...variable name and...
  prop_correct = data_means #... proportion correct
)

item_prop = masterKey %>% # join the masterKey (item name, correct answer, and form)
  full_join(all_item_means) %>% # with proportion correct
  mutate(prop_correct = round(prop_correct, 2)) # round to 2 decimal places

# next we count the number of administrations of each item
item_prop$N = colSums(!is.na(data)) # count non-missing

# here's the code to calculate the distractor index
# we removed this due to some uncertainty about interpretability
# data_re = read_csv(here("data/TDA_data_recoded.csv")) # use recoded dataset
# data_re = dplyr::filter(data_re, included == "Yes") # remove participants screened out for demographic and quality reasons

item_prop = item_prop %>%
  mutate(across(where(is.double), round, 2)) #round all numbers to 2 decimal places

item_prop %>%
 mutate(form = case_when(
   form == "C" ~ "A", # change these forms (C and D were fixes to poor or incorrect items)
   form == "D" ~ "B",
   TRUE ~ form
 )) %>%
  select(item, adjective, N, prop_correct, form) %>% #just show these columns
  DT::datatable( #make interactive html table
    colnames = c("Item", "Adjective", "N", "Proportion Correct", "Form"), #col names
    filter = "top", # can filter
    rownames = F #don't need
  )

item_prop %>%
  select(item, adjective, N, prop_correct, form) %>%
  write_csv(file = here("data/item_difficulty.csv"))

# This output file has been posted on Dataverse at the doi provided above.

Item Difficulty

Last updated 2022-01-04