{"cells":[{"cell_type":"markdown","metadata":{"id":"orrpsbmYU7pL"},"source":["# Tutorial on `edu-convokit` for the Amber dataset\n","\n","Welcome to the tutorial on `edu-convokit` for the [Amber dataset](https://github.com/laurenceholt/amber/tree/main). This tutorial will walk you through the process of using `edu-convokit` to pre-process, annotate and analyze the Amber dataset.\n","\n","If you are looking for a tutorial on the individual components of `edu-convokit`, please refer to the following tutorials to get started:\n","- [Text Pre-processing Colab](https://colab.research.google.com/drive/1a-EwYwkNYHSNcNThNTXe6DNpsis0bpQK)\n","- [Annotation Colab](https://colab.research.google.com/drive/1rBwEctFtmQowZHxralH2OGT5uV0zRIQw)\n","- [Analysis Colab](https://colab.research.google.com/drive/1xfrq5Ka3FZH7t9l87u4sa_oMlmMvuTfe)\n","\n","This tutorial will use all of the components!"]},{"cell_type":"markdown","metadata":{"id":"XvZ0-3nfU7pM"},"source":["## Installation\n","\n","Let's start by installing `edu-convokit` and importing the necessary modules.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":26345,"status":"ok","timestamp":1703933148182,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"},"user_tz":480},"id":"nGJhCqQYU7pN","outputId":"69afe5e0-9e09-441e-fe0b-ea39a207042a"},"outputs":[{"name":"stdout","output_type":"stream","text":["Collecting git+https://github.com/rosewang2008/edu-convokit.git\n"," Cloning https://github.com/rosewang2008/edu-convokit.git to /tmp/pip-req-build-repqq1x3\n"," Running command git clone --filter=blob:none --quiet https://github.com/rosewang2008/edu-convokit.git /tmp/pip-req-build-repqq1x3\n"," Resolved https://github.com/rosewang2008/edu-convokit.git to commit 2c36eabaf3d4dff1d8c1e89ae4f175ec80617f7e\n"," Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n","Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.66.1)\n","Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.23.5)\n","Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.11.4)\n","Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.8.1)\n","Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (2.1.0+cu121)\n","Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.35.2)\n","Collecting clean-text (from edu-convokit==0.0.1)\n"," Downloading clean_text-0.6.0-py3-none-any.whl (11 kB)\n","Requirement already satisfied: openpyxl in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.1.2)\n","Requirement already satisfied: spacy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.6.1)\n","Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.3.2)\n","Collecting num2words==0.5.10 (from edu-convokit==0.0.1)\n"," Downloading num2words-0.5.10-py3-none-any.whl (101 kB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m101.6/101.6 kB\u001b[0m \u001b[31m1.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hRequirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.2.2)\n","Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.7.1)\n","Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.12.2)\n","Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.5.3)\n","Collecting docopt>=0.6.2 (from num2words==0.5.10->edu-convokit==0.0.1)\n"," Downloading docopt-0.6.2.tar.gz (25 kB)\n"," Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n","Collecting emoji<2.0.0,>=1.0.0 (from clean-text->edu-convokit==0.0.1)\n"," Downloading emoji-1.7.0.tar.gz (175 kB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m175.4/175.4 kB\u001b[0m \u001b[31m3.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n","Collecting ftfy<7.0,>=6.0 (from clean-text->edu-convokit==0.0.1)\n"," Downloading ftfy-6.1.3-py3-none-any.whl (53 kB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m53.4/53.4 kB\u001b[0m \u001b[31m4.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hRequirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from gensim->edu-convokit==0.0.1) (6.4.0)\n","Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (1.2.0)\n","Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (0.12.1)\n","Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (4.46.0)\n","Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (1.4.5)\n","Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (23.2)\n","Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (9.4.0)\n","Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (3.1.1)\n","Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (2.8.2)\n","Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (8.1.7)\n","Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (1.3.2)\n","Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (2023.6.3)\n","Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.10/dist-packages (from openpyxl->edu-convokit==0.0.1) (1.1.0)\n","Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->edu-convokit==0.0.1) (2023.3.post1)\n","Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->edu-convokit==0.0.1) (3.2.0)\n","Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.0.12)\n","Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.0.5)\n","Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.0.10)\n","Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.0.8)\n","Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.0.9)\n","Requirement already satisfied: thinc<8.2.0,>=8.1.8 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (8.1.12)\n","Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.1.2)\n","Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.4.8)\n","Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.0.10)\n","Requirement already satisfied: typer<0.10.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (0.9.0)\n","Requirement already satisfied: pathy>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (0.10.3)\n","Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.31.0)\n","Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.10.13)\n","Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.1.2)\n","Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (67.7.2)\n","Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.3.0)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (3.13.1)\n","Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (4.5.0)\n","Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (1.12)\n","Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (3.2.1)\n","Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (2023.6.0)\n","Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (2.1.0)\n","Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.19.4)\n","Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (6.0.1)\n","Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.15.0)\n","Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.4.1)\n","Requirement already satisfied: wcwidth<0.3.0,>=0.2.12 in /usr/local/lib/python3.10/dist-packages (from ftfy<7.0,>=6.0->clean-text->edu-convokit==0.0.1) (0.2.12)\n","Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->edu-convokit==0.0.1) (1.16.0)\n","Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (3.3.2)\n","Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (3.6)\n","Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (2.0.7)\n","Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (2023.11.17)\n","Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.2.0,>=8.1.8->spacy->edu-convokit==0.0.1) (0.7.11)\n","Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.2.0,>=8.1.8->spacy->edu-convokit==0.0.1) (0.1.4)\n","Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy->edu-convokit==0.0.1) (2.1.3)\n","Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->edu-convokit==0.0.1) (1.3.0)\n","Building wheels for collected packages: edu-convokit, docopt, emoji\n"," Building wheel for edu-convokit (setup.py) ... \u001b[?25l\u001b[?25hdone\n"," Created wheel for edu-convokit: filename=edu_convokit-0.0.1-py3-none-any.whl size=24909 sha256=3e71c896ae285efe3570837daebb1e05f2855b510db11ebff66b503b2e1e7939\n"," Stored in directory: /tmp/pip-ephem-wheel-cache-hve93bfu/wheels/29/43/ec/d2472df0eb2af8f1e7d67d0710a4b3eb93fe983b15f8d7b841\n"," Building wheel for docopt (setup.py) ... \u001b[?25l\u001b[?25hdone\n"," Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13706 sha256=cd3e6e02d727798745a61ddf5b510db593fc4f6b164c6f242e607477dfb090ce\n"," Stored in directory: /root/.cache/pip/wheels/fc/ab/d4/5da2067ac95b36618c629a5f93f809425700506f72c9732fac\n"," Building wheel for emoji (setup.py) ... \u001b[?25l\u001b[?25hdone\n"," Created wheel for emoji: filename=emoji-1.7.0-py3-none-any.whl size=171033 sha256=a5a47ae4916b773f7e0049304c63c50d066bee6292e0860693efecda135b802f\n"," Stored in directory: /root/.cache/pip/wheels/31/8a/8c/315c9e5d7773f74b33d5ed33f075b49c6eaeb7cedbb86e2cf8\n","Successfully built edu-convokit docopt emoji\n","Installing collected packages: emoji, docopt, num2words, ftfy, clean-text, edu-convokit\n","Successfully installed clean-text-0.6.0 docopt-0.6.2 edu-convokit-0.0.1 emoji-1.7.0 ftfy-6.1.3 num2words-0.5.10\n"]}],"source":["!pip install git+https://github.com/rosewang2008/edu-convokit.git\n"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":25557,"status":"ok","timestamp":1703933173736,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"},"user_tz":480},"id":"lMFU9vJgU7pN","outputId":"3fec19d2-1f41-47cb-8f4b-f5ac3fcdc97e"},"outputs":[{"name":"stderr","output_type":"stream","text":["WARNING:root:Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.\n","[nltk_data] Downloading package stopwords to /root/nltk_data...\n","[nltk_data] Unzipping corpora/stopwords.zip.\n"]}],"source":["from edu_convokit.preprocessors import TextPreprocessor\n","from edu_convokit.annotation import Annotator\n","from edu_convokit.analyzers import (\n"," QualitativeAnalyzer,\n"," QuantitativeAnalyzer,\n"," LexicalAnalyzer,\n"," TemporalAnalyzer\n",")\n","# For helping us load data\n","from edu_convokit import utils\n","\n","import os\n","import tqdm"]},{"cell_type":"markdown","metadata":{"id":"XWo_LjzOU7pO"},"source":["## 📑 Data\n","\n","Let's download the dataset under `raw_data/`."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":793,"status":"ok","timestamp":1703933279655,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"},"user_tz":480},"id":"ORG_wOuDU7pO","outputId":"a1b8f1ff-e612-4511-817e-cf4a35664676"},"outputs":[{"name":"stdout","output_type":"stream","text":["--2023-12-30 10:47:59-- https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/amber.zip\n","Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n","Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 547335 (535K) [application/zip]\n","Saving to: ‘amber.zip’\n","\n","amber.zip 100%[===================>] 534.51K --.-KB/s in 0.04s \n","\n","2023-12-30 10:47:59 (11.9 MB/s) - ‘amber.zip’ saved [547335/547335]\n","\n"]}],"source":["# We will put the data here:\n","DATA_DIR = \"raw_data\"\n","!mkdir -p $DATA_DIR\n","\n","# We will put the annotated data here:\n","ANNOTATIONS_DIR = \"annotations\"\n","!mkdir -p $ANNOTATIONS_DIR\n","\n","# # Download the data\n","!wget \"https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/amber.zip\"\n","\n","# # Unzip the data\n","!unzip -n -q amber.zip -d $DATA_DIR\n","\n","# Data directory is then raw_data/amber\n","DATA_DIR = \"raw_data/amber\""]},{"cell_type":"code","execution_count":null,"metadata":{"id":"fouesc5MU7pO"},"outputs":[],"source":["# We'll set the important variables specific to this dataset. If you open one of the files, you'll see that the\n","# speaker and text columns are defined as:\n","TEXT_COLUMN = \"dialogue\"\n","SPEAKER_COLUMN = \"speaker\"\n","\n","# We will also define the annotation columns.\n","# For the purposes of this tutorial, we will only be using talktime, student_reasoning, and uptake.\n","TALK_TIME_COLUMN = \"talktime\"\n","STUDENT_REASONING_COLUMN = \"student_reasoning\"\n","UPTAKE_COLUMN = \"uptake\""]},{"cell_type":"markdown","metadata":{"id":"BQQHbS-9U7pO"},"source":["One thing that will be important is knowing how the teacher/tutor and student are represented in the dataset.\n","Let's load some examples and see how they are represented."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"executionInfo":{"elapsed":101,"status":"ok","timestamp":1703933296967,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"},"user_tz":480},"id":"c0hX2oMzU7pO","outputId":"87f64d4e-0147-419e-f5d8-b15e3f91a69f"},"outputs":[{"data":{"text/html":["\n","
| \n"," | start | \n","stop | \n","speaker | \n","dialogue | \n","
|---|---|---|---|---|
| 0 | \n","00:00.00 | \n","00:06.00 | \n","Tutor | \n","All right. Do you see the tools on the left-ha... | \n","
| 1 | \n","00:06.00 | \n","00:07.00 | \n","Student | \n","Yes. | \n","
| 2 | \n","00:07.00 | \n","00:19.00 | \n","Tutor | \n","All right. We're going to take a look at those... | \n","
| 3 | \n","00:19.00 | \n","00:31.00 | \n","Tutor | \n","So the third button down is a pencil. If you c... | \n","
| 4 | \n","00:31.00 | \n","00:37.00 | \n","Tutor | \n","If you want to type anything, you can use the ... | \n","