Creation

This page documents the DIDEC creation process in more detail, with links to scripts and resources.

Experimental setup

We selected 307 images from MS COCO that also occur in the SALICON dataset and in Visual Genome, for maximal compatibility. We ran two experiments:

Free viewing, where participants looked at a sequence of 102 or 103 images.
Production viewing, where participants were asked to provide spoken descriptions of each of the images.

Our participants were Dutch students who received university credits for taking part in the experiment. See the Downloads page for all the experimental data, including consent forms, instructions, and config files for our experiment.

Data processing

ExperimentCenter is only able to export the recorded sound as .avi files, which provide a full recording of the screen, and the eye movements as our participants describe the images. We used the ffmpy module in Python to convert the .avi files to .mp3 files. (ffmpy relies on the ffmpeg library in the background.) We then automatically transcribed the mp3 files using the built-in Dictation function for macOS. Because Dictation isn't built for transcribing audio recordings, we used the following strategy:

Merge the recordings using PyDub to transcribe 100 files at a time. We recorded the voice command for 'new line' to separate the recordings and have each description start on a new line.
Use Audacity and the Soundflower plugin to redirect the recordings to the microphone. See this tutorial for an explanation of how to set everything up.

Then, we manually aligned the transcribed recordings with their transcriptions. We converted files containing the transcriptions and the file IDs to a JSON format that we use as the input to our annotation tool, where all transcriptions are manually corrected. We also asked our annotator to mark corrections, repetitions, and (filled) pauses. Our annotation tool is described here in more detail.

Clean-up and data analysis

We carried out several different analyses.

Following the annotation stage, we inspected the annotations and corrected typing errors in the annotation labels (e.g. >corr> instead of <corr> to indicate a correction). We then produced the final annotation file, and computed general descriptive statistics. Download code and data here.
We then compared the eye-tracking data for the two different tasks. Download here.
We then looked at image specificity, and whether it can be predicted from eye-tracking data. Download here, and download our reimplementation of the image specificity metric here, or see the GitHub page.
Comparing the difference between spoken and written language. Download here.
Computing length statistics for the Flickr30K validation set (for comparison). Download here.