Creation

This page documents the DIDEC creation process in more detail, with links to scripts and resources.

Experimental setup

We selected 307 images from MS COCO that also occur in the SALICON dataset and in Visual Genome, for maximal compatibility. We ran two experiments:

  1. Free viewing, where participants looked at a sequence of 102 or 103 images.
  2. Production viewing, where participants were asked to provide spoken descriptions of each of the images.

Our participants were Dutch students who received university credits for taking part in the experiment. See the Downloads page for all the experimental data, including consent forms, instructions, and config files for our experiment.

Data processing

ExperimentCenter is only able to export the recorded sound as .avi files, which provide a full recording of the screen, and the eye movements as our participants describe the images. We used the ffmpy module in Python to convert the .avi files to .mp3 files. (ffmpy relies on the ffmpeg library in the background.) We then automatically transcribed the mp3 files using the built-in Dictation function for macOS. Because Dictation isn't built for transcribing audio recordings, we used the following strategy:

Then, we manually aligned the transcribed recordings with their transcriptions. We converted files containing the transcriptions and the file IDs to a JSON format that we use as the input to our annotation tool, where all transcriptions are manually corrected. We also asked our annotator to mark corrections, repetitions, and (filled) pauses. Our annotation tool is described here in more detail.

Clean-up and data analysis

We carried out several different analyses.