Data Sprint #20: Human Memory and Cognition
Models of human cognition hold that information processing occurs in a series of stages. Cognitive psychology, in particular, is concerned with the internal mental processes that begin with the appearance of an external stimulus and result in a behavioral response.
Explore human cognitive processes around the generation of narratives–with a focus on the language employed in stories about events that have been experienced versus imagined. Investigate and characterize cognitive processes involved in storytelling, contrasting imagination and recollection of events with the help of Data Science.
Build a machine learning model that would help you to categorize cognitive processes involved in storytelling - Imagined, Recalled or Retold.
Submissions are evaluated using Weighted F1 Score.
How do we do it?
Once we release the data, anyone can download it, build a model, and make a submission. We give competitors a set of data (training data) with both the independent and dependent variables.
We also release another set of data (test dataset) with just the independent variables, and we hide the dependent variable that corresponds with this set. You submit the predicted values of the dependent variable for this set and we compare it against the actual values.
The predictions are evaluated based on the evaluation metric defined in the datathon.
About the Data
The dataset contains short stories about recalled and imagined events.
These are the columns in the data:
- `AssignmentId`: Unique ID of this story
- `WorkTimeInSeconds`: Time in seconds that it took the worker to do the entire HIT (reading instructions, story writing, questions)
- `WorkerId`: Unique ID of the worker (random string, not MTurk worker ID)
- `annotatorAge`: Lower limit of the age bucket of the worker. Buckets are: 18-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55+
- `annotatorGender`: Gender of the worker
- `annotatorRace`: Race/ethnicity of the worker
- `distracted`: How distracted were you while writing your story? (5-point Likert)
- `draining`: How taxing/draining was writing for you emotionally? (5-point Likert)
- `frequency`: How often do you think about or talk about this event? (5-point Likert)
- `importance`: How impactful, important, or personal is this story/event to you? (5-point Likert)
- `logTimeSinceEvent`: Log of time (days) since the recalled event happened
- `mainEvent`: Short phrase describing the main event described
- `memType`: Type of story (recalled, imagined, retold) - The target variable
- `mostSurprising`: Short phrase describing what the most surprising aspect of the story was
- `openness`: Continuous variable representing the openness to experience of the worker
- `recAgnPairId`: ID of the recalled story that corresponds to this retold story (null for imagined stories). Group on this variable to get the recalled-retold pairs.
- `recImgPairId`: ID of the recalled story that corresponds to this imagined story (null for retold stories). Group on this variable to get the recalled-imagined pairs.
- `similarity`: How similar to your life does this event/story feel to you? (5-point Likert)
- `similarityReason`: Free text annotation of similarity
- `story`: Story about the imagined or recalled event (15-25 sentences)
- `stressful`: How stressful was this writing task? (5-point Likert)
- `summary`: Summary of the events in the story (1-3 sentences)
- `timeSinceEvent`: Time (number of days) since the recalled event happened
Likert scaling is a bipolar scaling method, measuring either positive or negative response to a statement.
Note: Feel free to Google some of the terms that are new to you!
To load the training data in your jupyter notebook, use the below command:
import pandas as pd
train_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/hippocorpus/train_set_label.csv" )
Saving Prediction File & Sample Submission
You can find more details on how to save a prediction file here: https://discuss.dphi.tech/t/how-to-submit-predictions/548
Sample submission: You should submit a CSV file with a header row and the sample submission can be found below
Note that the header name should be ‘prediction’ else it will throw an evaluation error. A sample submission file can be found here
Load the test data (name it as test_data). You can load the data using the below command.
test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/hippocorpus/test_set_label.csv')
Here the target column is deliberately not there as you need to predict it
This dataset is sourced from Microsoft Research Open Data.
Maarten Sap, Eric Horvitz, Yejin Choi, Noah A. Smith, and James Pennebaker (2020) _Recollection versus Imagination: Exploring Human Memory and Cognition via Neural Language Models._ ACL.
To participate in this challenge either you have to create a team of atleast None members or join some team