{"cells":[{"cell_type":"markdown","metadata":{"id":"0ZgsJOlTV-cA"},"source":["# Tutorial on `edu-convokit` for the TalkMoves dataset\n","\n","Welcome to the tutorial on `edu-convokit` for the [TalkMoves dataset](https://github.com/SumnerLab/TalkMoves). This tutorial will walk you through the process of using `edu-convokit` to pre-process, annotate and analyze the TalkMoves dataset.\n","\n","If you are looking for a tutorial on the individual components of `edu-convokit`, please refer to the following tutorials to get started:\n","- [Text Pre-processing Colab](https://colab.research.google.com/drive/1a-EwYwkNYHSNcNThNTXe6DNpsis0bpQK)\n","- [Annotation Colab](https://colab.research.google.com/drive/1rBwEctFtmQowZHxralH2OGT5uV0zRIQw)\n","- [Analysis Colab](https://colab.research.google.com/drive/1xfrq5Ka3FZH7t9l87u4sa_oMlmMvuTfe)\n","\n","This tutorial will use all of the components!"]},{"cell_type":"markdown","metadata":{"id":"qyot5jipV-cB"},"source":["## Installation\n","\n","Let's start by installing `edu-convokit` and importing the necessary modules.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":21593,"status":"ok","timestamp":1703936794770,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"},"user_tz":480},"id":"E1IPd9HnV-cC","outputId":"170889eb-0ae8-44bf-f12c-31399220d963"},"outputs":[{"name":"stdout","output_type":"stream","text":["Collecting git+https://github.com/rosewang2008/edu-convokit.git\n"," Cloning https://github.com/rosewang2008/edu-convokit.git to /tmp/pip-req-build-dgphjpe_\n"," Running command git clone --filter=blob:none --quiet https://github.com/rosewang2008/edu-convokit.git /tmp/pip-req-build-dgphjpe_\n"," Resolved https://github.com/rosewang2008/edu-convokit.git to commit 1e094c8836a3e3112cc1f996f5f12aeff013777c\n"," Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n","Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.66.1)\n","Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.23.5)\n","Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.11.4)\n","Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.8.1)\n","Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (2.1.0+cu121)\n","Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.35.2)\n","Collecting clean-text (from edu-convokit==0.0.1)\n"," Downloading clean_text-0.6.0-py3-none-any.whl (11 kB)\n","Requirement already satisfied: openpyxl in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.1.2)\n","Requirement already satisfied: spacy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.6.1)\n","Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.3.2)\n","Collecting num2words==0.5.10 (from edu-convokit==0.0.1)\n"," Downloading num2words-0.5.10-py3-none-any.whl (101 kB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m101.6/101.6 kB\u001b[0m \u001b[31m1.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hRequirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.2.2)\n","Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.7.1)\n","Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.12.2)\n","Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.5.3)\n","Collecting docopt>=0.6.2 (from num2words==0.5.10->edu-convokit==0.0.1)\n"," Downloading docopt-0.6.2.tar.gz (25 kB)\n"," Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n","Collecting emoji<2.0.0,>=1.0.0 (from clean-text->edu-convokit==0.0.1)\n"," Downloading emoji-1.7.0.tar.gz (175 kB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m175.4/175.4 kB\u001b[0m \u001b[31m5.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n","Collecting ftfy<7.0,>=6.0 (from clean-text->edu-convokit==0.0.1)\n"," Downloading ftfy-6.1.3-py3-none-any.whl (53 kB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m53.4/53.4 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hRequirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from gensim->edu-convokit==0.0.1) (6.4.0)\n","Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (1.2.0)\n","Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (0.12.1)\n","Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (4.46.0)\n","Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (1.4.5)\n","Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (23.2)\n","Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (9.4.0)\n","Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (3.1.1)\n","Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (2.8.2)\n","Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (8.1.7)\n","Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (1.3.2)\n","Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (2023.6.3)\n","Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.10/dist-packages (from openpyxl->edu-convokit==0.0.1) (1.1.0)\n","Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->edu-convokit==0.0.1) (2023.3.post1)\n","Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->edu-convokit==0.0.1) (3.2.0)\n","Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.0.12)\n","Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.0.5)\n","Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.0.10)\n","Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.0.8)\n","Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.0.9)\n","Requirement already satisfied: thinc<8.2.0,>=8.1.8 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (8.1.12)\n","Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.1.2)\n","Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.4.8)\n","Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.0.10)\n","Requirement already satisfied: typer<0.10.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (0.9.0)\n","Requirement already satisfied: pathy>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (0.10.3)\n","Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.31.0)\n","Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.10.13)\n","Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.1.2)\n","Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (67.7.2)\n","Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.3.0)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (3.13.1)\n","Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (4.5.0)\n","Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (1.12)\n","Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (3.2.1)\n","Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (2023.6.0)\n","Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (2.1.0)\n","Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.19.4)\n","Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (6.0.1)\n","Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.15.0)\n","Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.4.1)\n","Requirement already satisfied: wcwidth<0.3.0,>=0.2.12 in /usr/local/lib/python3.10/dist-packages (from ftfy<7.0,>=6.0->clean-text->edu-convokit==0.0.1) (0.2.12)\n","Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->edu-convokit==0.0.1) (1.16.0)\n","Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (3.3.2)\n","Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (3.6)\n","Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (2.0.7)\n","Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (2023.11.17)\n","Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.2.0,>=8.1.8->spacy->edu-convokit==0.0.1) (0.7.11)\n","Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.2.0,>=8.1.8->spacy->edu-convokit==0.0.1) (0.1.4)\n","Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy->edu-convokit==0.0.1) (2.1.3)\n","Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->edu-convokit==0.0.1) (1.3.0)\n","Building wheels for collected packages: edu-convokit, docopt, emoji\n"," Building wheel for edu-convokit (setup.py) ... \u001b[?25l\u001b[?25hdone\n"," Created wheel for edu-convokit: filename=edu_convokit-0.0.1-py3-none-any.whl size=25946 sha256=bacc5ae8cec78f73dd6432b9a641058237be062d59c7dcfcac080e9a19077bf3\n"," Stored in directory: /tmp/pip-ephem-wheel-cache-a92ctwua/wheels/29/43/ec/d2472df0eb2af8f1e7d67d0710a4b3eb93fe983b15f8d7b841\n"," Building wheel for docopt (setup.py) ... \u001b[?25l\u001b[?25hdone\n"," Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13706 sha256=19f3926503485ba42f4fb35754933106263ea928f13b10e358a34f5f263f839a\n"," Stored in directory: /root/.cache/pip/wheels/fc/ab/d4/5da2067ac95b36618c629a5f93f809425700506f72c9732fac\n"," Building wheel for emoji (setup.py) ... \u001b[?25l\u001b[?25hdone\n"," Created wheel for emoji: filename=emoji-1.7.0-py3-none-any.whl size=171033 sha256=0024d11da3567b1c7f328fd06e05831297bd61be31635baec2d057a050286c56\n"," Stored in directory: /root/.cache/pip/wheels/31/8a/8c/315c9e5d7773f74b33d5ed33f075b49c6eaeb7cedbb86e2cf8\n","Successfully built edu-convokit docopt emoji\n","Installing collected packages: emoji, docopt, num2words, ftfy, clean-text, edu-convokit\n","Successfully installed clean-text-0.6.0 docopt-0.6.2 edu-convokit-0.0.1 emoji-1.7.0 ftfy-6.1.3 num2words-0.5.10\n"]}],"source":["!pip install git+https://github.com/rosewang2008/edu-convokit.git\n"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":21995,"status":"ok","timestamp":1703936816761,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"},"user_tz":480},"id":"nf3j5W8AV-cC","outputId":"f2b9c0a7-dc15-4dc9-efb3-751c2b4e0a0a"},"outputs":[{"name":"stderr","output_type":"stream","text":["WARNING:root:Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.\n","[nltk_data] Downloading package stopwords to /root/nltk_data...\n","[nltk_data] Unzipping corpora/stopwords.zip.\n"]}],"source":["from edu_convokit.preprocessors import TextPreprocessor\n","from edu_convokit.annotation import Annotator\n","from edu_convokit.analyzers import (\n"," QualitativeAnalyzer,\n"," QuantitativeAnalyzer,\n"," LexicalAnalyzer,\n"," TemporalAnalyzer\n",")\n","# For helping us load data\n","from edu_convokit import utils\n","\n","import os\n","import tqdm"]},{"cell_type":"markdown","metadata":{"id":"iwnxZ8sLV-cC"},"source":["## 📑 Data\n","\n","Let's download the dataset under `raw_data/`.\n","Note we're only download a subsample of the dataset for this tutorial; this cuts down the annotation time.\n","If you would like to annotate the entire dataset, feel free to upload the entire dataset to this Colab!"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":550,"status":"ok","timestamp":1703936817307,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"},"user_tz":480},"id":"e66OQumaV-cD","outputId":"fc466ccb-a7be-4573-cc28-9a0dcbbd0a40"},"outputs":[{"name":"stdout","output_type":"stream","text":["--2023-12-30 11:46:56-- https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves.zip\n","Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n","Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 346774 (339K) [application/zip]\n","Saving to: ‘talkmoves.zip’\n","\n","\rtalkmoves.zip 0%[ ] 0 --.-KB/s \rtalkmoves.zip 100%[===================>] 338.65K --.-KB/s in 0.02s \n","\n","2023-12-30 11:46:56 (17.2 MB/s) - ‘talkmoves.zip’ saved [346774/346774]\n","\n"]}],"source":["# We will put the data here:\n","DATA_DIR = \"raw_data\"\n","!mkdir -p $DATA_DIR\n","\n","# We will put the annotated data here:\n","ANNOTATIONS_DIR = \"annotations\"\n","!mkdir -p $ANNOTATIONS_DIR\n","\n","# # Download the data\n","!wget \"https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves.zip\"\n","\n","# # Unzip the data\n","!unzip -n -q talkmoves.zip -d $DATA_DIR\n","\n","# Data directory is then raw_data/talkmoves\n","DATA_DIR = \"raw_data/talkmoves\""]},{"cell_type":"code","execution_count":null,"metadata":{"id":"uFT_56rtV-cD"},"outputs":[],"source":["# We'll set the important variables specific to this dataset. If you open one of the files, you'll see that the\n","# speaker and text columns are defined as:\n","TEXT_COLUMN = \"Sentence\"\n","SPEAKER_COLUMN = \"Speaker\"\n","\n","# We will also define the annotation columns.\n","# For the purposes of this tutorial, we will only be using talktime, student_reasoning, and uptake.\n","TALK_TIME_COLUMN = \"talktime\"\n","STUDENT_REASONING_COLUMN = \"student_reasoning\"\n","UPTAKE_COLUMN = \"uptake\""]},{"cell_type":"markdown","metadata":{"id":"28sYVNWJV-cD"},"source":["One thing that will be important is knowing how the teacher/tutor and student are represented in the dataset.\n","Let's load some examples and see how they are represented."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"LUqIcq6YV-cD"},"outputs":[],"source":["files = os.listdir(DATA_DIR)\n","files = [os.path.join(DATA_DIR, f) for f in files if utils.is_valid_file_extension(f)]\n","\n","df = utils.merge_dataframes_in_list(files)"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":519},"executionInfo":{"elapsed":4,"status":"ok","timestamp":1703936943212,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"},"user_tz":480},"id":"yZzWrdaxV-cD","outputId":"30c85bab-849b-46ea-ba8e-804b7361e10a"},"outputs":[{"data":{"text/html":["\n","
| \n"," | Unnamed: 0 | \n","TimeStamp | \n","Turn | \n","Speaker | \n","Sentence | \n","Teacher Tag | \n","Student Tag | \n","
|---|---|---|---|---|---|---|---|
| 5 | \n","NaN | \n","NaN | \n","1.0 | \n","T/R1 | \n","Do you remember it looks like this. | \n","1 - None | \n","NaN | \n","
| 94 | \n","94.0 | \n","NaN | \n","49.0 | \n","T | \n","How many of you disagree? | \n","3 - Getting Students to Relate | \n","NaN | \n","
| 53 | \n","53.0 | \n","NaN | \n","NaN | \n","Erik and Brian | \n","Yeah | \n","NaN | \n","2 - Relating to Another Student | \n","
| 135 | \n","135.0 | \n","NaN | \n","86.0 | \n","Mark | \n","If, if the blue was one whole, what would the ... | \n","NaN | \n","3 - Asking for More Information | \n","
| 193 | \n","193.0 | \n","NaN | \n","85.0 | \n","T | \n","Or the people who aren't sure want to tell us... | \n","2 - Keeping Everyone Together | \n","NaN | \n","
| 93 | \n","93.0 | \n","NaN | \n","17.0 | \n","T | \n","Joey? | \n","2 - Keeping Everyone Together | \n","NaN | \n","
| 34 | \n","34.0 | \n","NaN | \n","11.0 | \n","T | \n","I want to call the white rod one half. | \n","1 - None | \n","NaN | \n","
| 42 | \n","42.0 | \n","NaN | \n","31.0 | \n","Alan | \n","[Puts three light green rods on top of the blu... | \n","NaN | \n","5 - Providing Evidence / Explaining Reasoning | \n","
| 46 | \n","NaN | \n","NaN | \n","31.0 | \n","T/R1 | \n","Do the number names change? | \n","2 - Keeping Everyone Together | \n","NaN | \n","
| 839 | \n","839.0 | \n","NaN | \n","481.0 | \n","T | \n","Okay, but why, how could she be sure? | \n","3 - Getting Students to Relate | \n","NaN | \n","