{"cells":[{"cell_type":"markdown","metadata":{"id":"pmM0zX-SW4M4"},"source":["# Tutorial on `edu-convokit` for the NCTE dataset\n","\n","Welcome to the tutorial on `edu-convokit` for the [NCTE dataset](https://github.com/ddemszky/classroom-transcript-analysis). This tutorial will walk you through the process of using `edu-convokit` to pre-process, annotate and analyze the NCTE dataset.\n","\n","If you are looking for a tutorial on the individual components of `edu-convokit`, please refer to the following tutorials to get started:\n","- [Text Pre-processing Colab](https://colab.research.google.com/drive/1a-EwYwkNYHSNcNThNTXe6DNpsis0bpQK)\n","- [Annotation Colab](https://colab.research.google.com/drive/1rBwEctFtmQowZHxralH2OGT5uV0zRIQw)\n","- [Analysis Colab](https://colab.research.google.com/drive/1xfrq5Ka3FZH7t9l87u4sa_oMlmMvuTfe)\n","\n","This tutorial will use all of the components!"]},{"cell_type":"markdown","metadata":{"id":"fxcdKqHeW4M6"},"source":["## Installation\n","\n","Let's start by installing `edu-convokit` and importing the necessary modules.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":16955,"status":"ok","timestamp":1703933634344,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"},"user_tz":480},"id":"EHd_2SRNW4M6","outputId":"e21ff79d-1a0a-4883-9ab1-84d1b809fd6a"},"outputs":[{"name":"stdout","output_type":"stream","text":["Collecting git+https://github.com/rosewang2008/edu-convokit.git\n"," Cloning https://github.com/rosewang2008/edu-convokit.git to /tmp/pip-req-build-ncsfguml\n"," Running command git clone --filter=blob:none --quiet https://github.com/rosewang2008/edu-convokit.git /tmp/pip-req-build-ncsfguml\n"," Resolved https://github.com/rosewang2008/edu-convokit.git to commit 5c1128c8f94d7574bc61cc56f29ce64bdca4ae30\n"," Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n","Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.66.1)\n","Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.23.5)\n","Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.11.4)\n","Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.8.1)\n","Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (2.1.0+cu121)\n","Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.35.2)\n","Collecting clean-text (from edu-convokit==0.0.1)\n"," Downloading clean_text-0.6.0-py3-none-any.whl (11 kB)\n","Requirement already satisfied: openpyxl in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.1.2)\n","Requirement already satisfied: spacy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.6.1)\n","Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.3.2)\n","Collecting num2words==0.5.10 (from edu-convokit==0.0.1)\n"," Downloading num2words-0.5.10-py3-none-any.whl (101 kB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m101.6/101.6 kB\u001b[0m \u001b[31m1.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hRequirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.2.2)\n","Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.7.1)\n","Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.12.2)\n","Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.5.3)\n","Collecting docopt>=0.6.2 (from num2words==0.5.10->edu-convokit==0.0.1)\n"," Downloading docopt-0.6.2.tar.gz (25 kB)\n"," Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n","Collecting emoji<2.0.0,>=1.0.0 (from clean-text->edu-convokit==0.0.1)\n"," Downloading emoji-1.7.0.tar.gz (175 kB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m175.4/175.4 kB\u001b[0m \u001b[31m3.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n","Collecting ftfy<7.0,>=6.0 (from clean-text->edu-convokit==0.0.1)\n"," Downloading ftfy-6.1.3-py3-none-any.whl (53 kB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m53.4/53.4 kB\u001b[0m \u001b[31m5.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hRequirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from gensim->edu-convokit==0.0.1) (6.4.0)\n","Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (1.2.0)\n","Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (0.12.1)\n","Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (4.46.0)\n","Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (1.4.5)\n","Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (23.2)\n","Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (9.4.0)\n","Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (3.1.1)\n","Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (2.8.2)\n","Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (8.1.7)\n","Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (1.3.2)\n","Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (2023.6.3)\n","Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.10/dist-packages (from openpyxl->edu-convokit==0.0.1) (1.1.0)\n","Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->edu-convokit==0.0.1) (2023.3.post1)\n","Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->edu-convokit==0.0.1) (3.2.0)\n","Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.0.12)\n","Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.0.5)\n","Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.0.10)\n","Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.0.8)\n","Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.0.9)\n","Requirement already satisfied: thinc<8.2.0,>=8.1.8 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (8.1.12)\n","Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.1.2)\n","Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.4.8)\n","Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.0.10)\n","Requirement already satisfied: typer<0.10.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (0.9.0)\n","Requirement already satisfied: pathy>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (0.10.3)\n","Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.31.0)\n","Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.10.13)\n","Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.1.2)\n","Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (67.7.2)\n","Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.3.0)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (3.13.1)\n","Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (4.5.0)\n","Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (1.12)\n","Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (3.2.1)\n","Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (2023.6.0)\n","Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (2.1.0)\n","Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.19.4)\n","Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (6.0.1)\n","Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.15.0)\n","Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.4.1)\n","Requirement already satisfied: wcwidth<0.3.0,>=0.2.12 in /usr/local/lib/python3.10/dist-packages (from ftfy<7.0,>=6.0->clean-text->edu-convokit==0.0.1) (0.2.12)\n","Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->edu-convokit==0.0.1) (1.16.0)\n","Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (3.3.2)\n","Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (3.6)\n","Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (2.0.7)\n","Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (2023.11.17)\n","Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.2.0,>=8.1.8->spacy->edu-convokit==0.0.1) (0.7.11)\n","Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.2.0,>=8.1.8->spacy->edu-convokit==0.0.1) (0.1.4)\n","Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy->edu-convokit==0.0.1) (2.1.3)\n","Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->edu-convokit==0.0.1) (1.3.0)\n","Building wheels for collected packages: edu-convokit, docopt, emoji\n"," Building wheel for edu-convokit (setup.py) ... \u001b[?25l\u001b[?25hdone\n"," Created wheel for edu-convokit: filename=edu_convokit-0.0.1-py3-none-any.whl size=24909 sha256=257e42119604caf33f42981c96e1407bdb7733bc8082e870a022def70e9310c2\n"," Stored in directory: /tmp/pip-ephem-wheel-cache-vp1s8qol/wheels/29/43/ec/d2472df0eb2af8f1e7d67d0710a4b3eb93fe983b15f8d7b841\n"," Building wheel for docopt (setup.py) ... \u001b[?25l\u001b[?25hdone\n"," Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13706 sha256=f02c574964e73b6f4c06b4c638b3da8b4cda7eac2dfe1604b8aff6f95701bfb3\n"," Stored in directory: /root/.cache/pip/wheels/fc/ab/d4/5da2067ac95b36618c629a5f93f809425700506f72c9732fac\n"," Building wheel for emoji (setup.py) ... \u001b[?25l\u001b[?25hdone\n"," Created wheel for emoji: filename=emoji-1.7.0-py3-none-any.whl size=171033 sha256=0a5d69f4ac14f376271ac4011aefc99eb821dcd5b2584b28eefd54bca2b8cfc0\n"," Stored in directory: /root/.cache/pip/wheels/31/8a/8c/315c9e5d7773f74b33d5ed33f075b49c6eaeb7cedbb86e2cf8\n","Successfully built edu-convokit docopt emoji\n","Installing collected packages: emoji, docopt, num2words, ftfy, clean-text, edu-convokit\n","Successfully installed clean-text-0.6.0 docopt-0.6.2 edu-convokit-0.0.1 emoji-1.7.0 ftfy-6.1.3 num2words-0.5.10\n"]}],"source":["!pip install edu-convokit"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":14881,"status":"ok","timestamp":1703933649222,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"},"user_tz":480},"id":"x0Q58WdEW4M7","outputId":"cc67e4d1-0930-46f8-cc6a-ad3c19164f04"},"outputs":[{"name":"stderr","output_type":"stream","text":["WARNING:root:Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.\n","[nltk_data] Downloading package stopwords to /root/nltk_data...\n","[nltk_data] Unzipping corpora/stopwords.zip.\n"]}],"source":["from edu_convokit.preprocessors import TextPreprocessor\n","from edu_convokit.annotation import Annotator\n","from edu_convokit.analyzers import (\n"," QualitativeAnalyzer,\n"," QuantitativeAnalyzer,\n"," LexicalAnalyzer,\n"," TemporalAnalyzer,\n"," GPTConversationAnalyzer\n",")\n","# For helping us load data\n","from edu_convokit import utils\n","\n","import os\n","import tqdm"]},{"cell_type":"markdown","metadata":{"id":"plFdw3csW4M7"},"source":["## 📑 Data\n","\n","Let's download the dataset under `raw_data/`.\n","Note we're only download a subsample of the NCTE dataset for this tutorial; this cuts down the annotation time.\n","If you would like to annotate the entire dataset, feel free to upload the entire dataset to this Colab!"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":701,"status":"ok","timestamp":1703933659477,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"},"user_tz":480},"id":"8w8_hOq8W4M7","outputId":"8e8ca0e5-c4a9-4a40-8f6a-9eebe0e10333"},"outputs":[{"name":"stdout","output_type":"stream","text":["--2023-12-30 10:54:18-- https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/ncte.zip\n","Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...\n","Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 346717 (339K) [application/zip]\n","Saving to: ‘ncte.zip’\n","\n","\rncte.zip 0%[ ] 0 --.-KB/s \rncte.zip 100%[===================>] 338.59K --.-KB/s in 0.03s \n","\n","2023-12-30 10:54:19 (9.57 MB/s) - ‘ncte.zip’ saved [346717/346717]\n","\n"]}],"source":["# We will put the data here:\n","DATA_DIR = \"raw_data\"\n","!mkdir -p $DATA_DIR\n","\n","# We will put the annotated data here:\n","ANNOTATIONS_DIR = \"annotations\"\n","!mkdir -p $ANNOTATIONS_DIR\n","\n","# # Download the data\n","!wget \"https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/ncte.zip\"\n","\n","# # Unzip the data\n","!unzip -n -q ncte.zip -d $DATA_DIR\n","\n","# Data directory is then raw_data/talkmoves\n","DATA_DIR = \"raw_data/ncte\""]},{"cell_type":"code","execution_count":null,"metadata":{"id":"dATniCmtW4M7"},"outputs":[],"source":["# We'll set the important variables specific to this dataset. If you open one of the files, you'll see that the\n","# speaker and text columns are defined as:\n","TEXT_COLUMN = \"text\"\n","SPEAKER_COLUMN = \"speaker\"\n","\n","# We will also define the annotation columns.\n","# For the purposes of this tutorial, we will only be using talktime, student_reasoning, and uptake.\n","TALK_TIME_COLUMN = \"talktime\"\n","STUDENT_REASONING_COLUMN = \"student_reasoning\"\n","UPTAKE_COLUMN = \"uptake\""]},{"cell_type":"markdown","metadata":{"id":"hGYZAQzKW4M7"},"source":["One thing that will be important is knowing how the teacher/tutor and student are represented in the dataset.\n","Let's load some examples and see how they are represented."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":223},"executionInfo":{"elapsed":257,"status":"ok","timestamp":1703933669958,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"},"user_tz":480},"id":"KrBZyToSW4M7","outputId":"764874f7-b558-4251-ec99-f286e27f8349"},"outputs":[{"name":"stdout","output_type":"stream","text":["['teacher' 'multiple students' 'student']\n"]},{"data":{"text/html":["\n","
| \n"," | text | \n","speaker | \n","talktime_words | \n","math_density | \n","uptake | \n","student_reasoning | \n","focusing_questions | \n","
|---|---|---|---|---|---|---|---|
| 0 | \n","Okay. I think it’s working. Alright, so the ... | \n","teacher | \n","17 | \n","0 | \n","NaN | \n","NaN | \n","0.0 | \n","
| 1 | \n","Yes. | \n","multiple students | \n","1 | \n","0 | \n","NaN | \n","NaN | \n","NaN | \n","
| 2 | \n","Student M, you don’t have your homework? | \n","teacher | \n","7 | \n","0 | \n","NaN | \n","NaN | \n","0.0 | \n","
| 3 | \n","No. | \n","student | \n","1 | \n","0 | \n","NaN | \n","NaN | \n","NaN | \n","
| 4 | \n","Did you hand it in? | \n","teacher | \n","5 | \n","0 | \n","NaN | \n","NaN | \n","0.0 | \n","