{"cells":[{"cell_type":"markdown","metadata":{"id":"k42uhYCNL17R"},"source":["# Tutorial on Text Pre-Processing for Education Language Data\n","\n","Welcome to this tutorial on using [`edu-convokit`](https://github.com/rosewang2008/edu-convokit) for text pre-processing.\n","Text pre-processing is a critical step in handling education language data.\n","- It ensures the data is clean (education data is notoriously messy).\n","- It ensures the data is standardized, ready for annotation and analysis.\n","- It ensures that the students and educators are anonymized; this is important to protect the privacy of individuals involved and allow for safe secondary data analysis.\n","\n","`edu-convokit` is designed to support these purposes.\n","\n","## 📚 Learning Objectives\n","\n","In this tutorial, you will learn how to use `TextPreprocessor` to:\n","\n","- Section Link 🔗: Anonymize your data when you know the names of your students and educators.\n","- Section Link 🔗: Anonymize your data when you do _not_ know the names of your students and educators.\n","- Section Link 🔗: Standardize your data for downstream feature annotation.\n","\n","Without further ado, let's get started!"]},{"cell_type":"markdown","metadata":{"id":"8kPOLmVwL17T"},"source":["## Installation\n","\n","Let's first install `edu-convokit`.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"4YQtoaRaL17U","executionInfo":{"status":"ok","timestamp":1703931918080,"user_tz":480,"elapsed":19709,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"}},"outputId":"c839492b-efb6-4269-c669-6d40b8c24ff7","colab":{"base_uri":"https://localhost:8080/"}},"outputs":[{"output_type":"stream","name":"stdout","text":["Collecting git+https://github.com/rosewang2008/edu-convokit.git\n"," Cloning https://github.com/rosewang2008/edu-convokit.git to /tmp/pip-req-build-580kdce9\n"," Running command git clone --filter=blob:none --quiet https://github.com/rosewang2008/edu-convokit.git /tmp/pip-req-build-580kdce9\n"," Resolved https://github.com/rosewang2008/edu-convokit.git to commit 8eb087b51abfa36a7031bf1de4e3dc40d8848186\n"," Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n","Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.66.1)\n","Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.23.5)\n","Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.11.4)\n","Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.8.1)\n","Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (2.1.0+cu121)\n","Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.35.2)\n","Requirement already satisfied: clean-text in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.6.0)\n","Requirement already satisfied: openpyxl in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.1.2)\n","Requirement already satisfied: spacy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.6.1)\n","Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.3.2)\n","Requirement already satisfied: num2words==0.5.10 in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.5.10)\n","Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.2.2)\n","Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.7.1)\n","Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.12.2)\n","Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (2.1.4)\n","Requirement already satisfied: docopt>=0.6.2 in /usr/local/lib/python3.10/dist-packages (from num2words==0.5.10->edu-convokit==0.0.1) (0.6.2)\n","Requirement already satisfied: emoji<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from clean-text->edu-convokit==0.0.1) (1.7.0)\n","Requirement already satisfied: ftfy<7.0,>=6.0 in /usr/local/lib/python3.10/dist-packages (from clean-text->edu-convokit==0.0.1) (6.1.3)\n","Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from gensim->edu-convokit==0.0.1) (6.4.0)\n","Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (1.2.0)\n","Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (0.12.1)\n","Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (4.46.0)\n","Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (1.4.5)\n","Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (23.2)\n","Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (9.4.0)\n","Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (3.1.1)\n","Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (2.8.2)\n","Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (8.1.7)\n","Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (1.3.2)\n","Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (2023.6.3)\n","Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.10/dist-packages (from openpyxl->edu-convokit==0.0.1) (1.1.0)\n","Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->edu-convokit==0.0.1) (2023.3.post1)\n","Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->edu-convokit==0.0.1) (2023.4)\n","Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->edu-convokit==0.0.1) (3.2.0)\n","Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.0.12)\n","Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.0.5)\n","Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.0.10)\n","Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.0.8)\n","Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.0.9)\n","Requirement already satisfied: thinc<8.2.0,>=8.1.8 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (8.1.12)\n","Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.1.2)\n","Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.4.8)\n","Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.0.10)\n","Requirement already satisfied: typer<0.10.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (0.9.0)\n","Requirement already satisfied: pathy>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (0.10.3)\n","Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.31.0)\n","Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.10.13)\n","Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.1.2)\n","Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (67.7.2)\n","Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.3.0)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (3.13.1)\n","Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (4.5.0)\n","Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (1.12)\n","Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (3.2.1)\n","Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (2023.6.0)\n","Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (2.1.0)\n","Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.19.4)\n","Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (6.0.1)\n","Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.15.0)\n","Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.4.1)\n","Requirement already satisfied: wcwidth<0.3.0,>=0.2.12 in /usr/local/lib/python3.10/dist-packages (from ftfy<7.0,>=6.0->clean-text->edu-convokit==0.0.1) (0.2.12)\n","Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->edu-convokit==0.0.1) (1.16.0)\n","Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (3.3.2)\n","Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (3.6)\n","Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (2.0.7)\n","Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (2023.11.17)\n","Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.2.0,>=8.1.8->spacy->edu-convokit==0.0.1) (0.7.11)\n","Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.2.0,>=8.1.8->spacy->edu-convokit==0.0.1) (0.1.4)\n","Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy->edu-convokit==0.0.1) (2.1.3)\n","Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->edu-convokit==0.0.1) (1.3.0)\n"]}],"source":["!pip install git+https://github.com/rosewang2008/edu-convokit.git"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"aYeytk12L17V"},"outputs":[],"source":["from edu_convokit.preprocessors import TextPreprocessor\n","\n","# For helping us flexibly load data\n","from edu_convokit import utils"]},{"cell_type":"markdown","metadata":{"id":"LcKP8AqWL17V"},"source":["## 📑 Data\n","\n","Let's load the data we'll be working with. We're going to be using a transcript from the [TalkMoves dataset](https://github.com/SumnerLab/TalkMoves)."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"U2BkkCwFL17V","outputId":"593dc399-3654-41bf-eb95-d8427eab3752","colab":{"base_uri":"https://localhost:8080/","height":554},"executionInfo":{"status":"ok","timestamp":1703931933263,"user_tz":480,"elapsed":490,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["--2023-12-30 10:25:32-- https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves/Boats%20and%20Fish%202_Grade%204.xlsx\n","Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...\n","Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 10528 (10K) [application/octet-stream]\n","Saving to: ‘Boats and Fish 2_Grade 4.xlsx.3’\n","\n","\r Boats and 0%[ ] 0 --.-KB/s \rBoats and Fish 2_Gr 100%[===================>] 10.28K --.-KB/s in 0s \n","\n","2023-12-30 10:25:32 (61.8 MB/s) - ‘Boats and Fish 2_Grade 4.xlsx.3’ saved [10528/10528]\n","\n"]},{"output_type":"execute_result","data":{"text/plain":[" Unnamed: 0 TimeStamp Turn Speaker \\\n","25 25 NaN 14.0 David \n","26 26 NaN 14.0 David \n","27 27 NaN 15.0 T \n","28 28 NaN 15.0 T \n","29 29 NaN 16.0 Beth \n","30 30 NaN 17.0 David \n","31 31 NaN NaN T \n","32 32 NaN 17.0 David \n","33 33 NaN 17.0 Meredith and David \n","34 34 NaN 18.0 David \n","\n"," Sentence \\\n","25 Yeah, I know, and put ‘em up to there, and tha... \n","26 Hey, wait a minute, hey wait, maybe that’s it,... \n","27 Now take six of the ones \n","28 Which is bigger? \n","29 One half \n","30 I think one half is... \n","31 Yes, David and Meredith? \n","32 What do you have? \n","33 Well \n","34 we think \n","\n"," Teacher Tag Student Tag \n","25 NaN 4 - Making a Claim \n","26 NaN 4 - Making a Claim \n","27 1 - None NaN \n","28 8 - Press for Accuracy NaN \n","29 NaN 4 - Making a Claim \n","30 NaN 2 - Relating to Another Student \n","31 2 - Keeping Everyone Together NaN \n","32 NaN 2 - Relating to Another Student \n","33 NaN 1 - None \n","34 NaN 1 - None "],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
Unnamed: 0TimeStampTurnSpeakerSentenceTeacher TagStudent Tag
2525NaN14.0DavidYeah, I know, and put ‘em up to there, and tha...NaN4 - Making a Claim
2626NaN14.0DavidHey, wait a minute, hey wait, maybe that’s it,...NaN4 - Making a Claim
2727NaN15.0TNow take six of the ones1 - NoneNaN
2828NaN15.0TWhich is bigger?8 - Press for AccuracyNaN
2929NaN16.0BethOne halfNaN4 - Making a Claim
3030NaN17.0DavidI think one half is...NaN2 - Relating to Another Student
3131NaNNaNTYes, David and Meredith?2 - Keeping Everyone TogetherNaN
3232NaN17.0DavidWhat do you have?NaN2 - Relating to Another Student
3333NaN17.0Meredith and DavidWellNaN1 - None
3434NaN18.0Davidwe thinkNaN1 - None
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":3}],"source":["!wget \"https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves/Boats and Fish 2_Grade 4.xlsx\"\n","\n","data_fname = \"Boats and Fish 2_Grade 4.xlsx\"\n","df = utils.load_data(data_fname) # Handles loading data from different file types including: .csv, .xlsx, .json\n","\n","# Show these lines because they contain names in the speaker and text columns.\n","df[25:35]"]},{"cell_type":"markdown","metadata":{"id":"mKpDbA76L17W"},"source":["### Some things to observe about the data...\n","\n","💡 Note: `edu-convokit` cares about two key columns: a column for the speaker and a column for the text.\n","- In the TalkMoves dataset, the speaker is in the `Speaker` column and the text is in the `Sentence` column. We can create two variables to store these column names as these will be used throughout the tutorial.\n","\n","💡 Note: We see that the names occur in the speaker and text column\n","- e.g., names like David and Meredith appear in the speaker and text column.\n","- The teacher is always shortened to \"T\" in the speaker column.\n","\n","💡 Note: The utterances from the same speaker are not always grouped together.\n","- We'll fix this in the section on standardizing the data for downstream annotation and analysis."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"XXWRYTXJL17W"},"outputs":[],"source":["# Creating variables for the columns we want to use\n","TEXT_COLUMN = \"Sentence\"\n","SPEAKER_COLUMN = \"Speaker\"\n"]},{"cell_type":"markdown","metadata":{"id":"odHyvBDKL17W"},"source":["## 📝 Anonymizing Data with Known Names\n","\n","We will now anonymize the data when we know the names of the students and educators in the dataset.\n","From our experience, this is the most common scenario in education language data where the names of the students and educators are known.\n","For example, these names come from a roster or a list of students in a class, or are officially recorded in a database.\n","\n","To do this, we need to create a list of names that we want to anonymize, and a list of replacement names that we want to use to replace the names in the dataset.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"PVyIHOcdL17W","outputId":"af399ea4-e28c-4f22-b82b-60d43df552b5","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1703931933558,"user_tz":480,"elapsed":4,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["['T' 'David' 'Meredith' 'Beth' 'Meredith and David' 'T 2']\n"]}],"source":["# Show the names of the speakers. In your use case, you might load this from a file or database.\n","print(df[SPEAKER_COLUMN].unique())"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"-eSSt5ggL17X","outputId":"bd8ae703-252f-413b-ebc9-99d74c2bc31a","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1703931933558,"user_tz":480,"elapsed":3,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["['[STUDENT_0]', '[STUDENT_1]', '[STUDENT_2]']\n"]}],"source":["# Create list of names and replacement names. We will make the replacement names unique so that we can easily find them later.\n","known_names = [\"David\", \"Meredith\", \"Beth\"]\n","known_replacement_names = [f\"[STUDENT_{i}]\" for i in range(len(known_names))]\n","print(known_replacement_names)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"cwB9NqieL17X"},"outputs":[],"source":["# Now let's anonymize the names in the text!\n","processor = TextPreprocessor()\n","df = processor.anonymize_known_names(\n"," df=df,\n"," text_column=TEXT_COLUMN,\n"," names=known_names,\n"," replacement_names=known_replacement_names,\n"," # We will directly replace the names in the text column.\n"," # If you want to keep the original text, you can set `target_text_column` to a new column name.\n"," target_text_column=TEXT_COLUMN\n",")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"G5c836CqL17Y","outputId":"09aacde0-d8da-4838-a6f6-1e17d6d49a71","colab":{"base_uri":"https://localhost:8080/","height":363},"executionInfo":{"status":"ok","timestamp":1703931933744,"user_tz":480,"elapsed":188,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" Unnamed: 0 TimeStamp Turn Speaker \\\n","25 25 NaN 14.0 David \n","26 26 NaN 14.0 David \n","27 27 NaN 15.0 T \n","28 28 NaN 15.0 T \n","29 29 NaN 16.0 Beth \n","30 30 NaN 17.0 David \n","31 31 NaN NaN T \n","32 32 NaN 17.0 David \n","33 33 NaN 17.0 Meredith and David \n","34 34 NaN 18.0 David \n","\n"," Sentence \\\n","25 Yeah, I know, and put ‘em up to there, and tha... \n","26 Hey, wait a minute, hey wait, maybe that’s it,... \n","27 Now take six of the ones \n","28 Which is bigger? \n","29 One half \n","30 I think one half is... \n","31 Yes, [STUDENT_0] and [STUDENT_1]? \n","32 What do you have? \n","33 Well \n","34 we think \n","\n"," Teacher Tag Student Tag \n","25 NaN 4 - Making a Claim \n","26 NaN 4 - Making a Claim \n","27 1 - None NaN \n","28 8 - Press for Accuracy NaN \n","29 NaN 4 - Making a Claim \n","30 NaN 2 - Relating to Another Student \n","31 2 - Keeping Everyone Together NaN \n","32 NaN 2 - Relating to Another Student \n","33 NaN 1 - None \n","34 NaN 1 - None "],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
Unnamed: 0TimeStampTurnSpeakerSentenceTeacher TagStudent Tag
2525NaN14.0DavidYeah, I know, and put ‘em up to there, and tha...NaN4 - Making a Claim
2626NaN14.0DavidHey, wait a minute, hey wait, maybe that’s it,...NaN4 - Making a Claim
2727NaN15.0TNow take six of the ones1 - NoneNaN
2828NaN15.0TWhich is bigger?8 - Press for AccuracyNaN
2929NaN16.0BethOne halfNaN4 - Making a Claim
3030NaN17.0DavidI think one half is...NaN2 - Relating to Another Student
3131NaNNaNTYes, [STUDENT_0] and [STUDENT_1]?2 - Keeping Everyone TogetherNaN
3232NaN17.0DavidWhat do you have?NaN2 - Relating to Another Student
3333NaN17.0Meredith and DavidWellNaN1 - None
3434NaN18.0Davidwe thinkNaN1 - None
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":8}],"source":["# Let's see what the anonymized text looks like!\n","df.iloc[25:35]"]},{"cell_type":"markdown","metadata":{"id":"UpPLDTchL17Y"},"source":["💡 Note: Nice, we can see that the text has been anonymized (e.g., line 31)!\n","\n","However, the speaker names have not been anonymized. Let's fix that."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"ieT9faoxL17Y","outputId":"72d2fac9-6a31-4892-e10f-60293db4c9d5","colab":{"base_uri":"https://localhost:8080/","height":363},"executionInfo":{"status":"ok","timestamp":1703931933744,"user_tz":480,"elapsed":8,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" Unnamed: 0 TimeStamp Turn Speaker \\\n","25 25 NaN 14.0 [STUDENT_0] \n","26 26 NaN 14.0 [STUDENT_0] \n","27 27 NaN 15.0 T \n","28 28 NaN 15.0 T \n","29 29 NaN 16.0 [STUDENT_2] \n","30 30 NaN 17.0 [STUDENT_0] \n","31 31 NaN NaN T \n","32 32 NaN 17.0 [STUDENT_0] \n","33 33 NaN 17.0 [STUDENT_1] and [STUDENT_0] \n","34 34 NaN 18.0 [STUDENT_0] \n","\n"," Sentence \\\n","25 Yeah, I know, and put ‘em up to there, and tha... \n","26 Hey, wait a minute, hey wait, maybe that’s it,... \n","27 Now take six of the ones \n","28 Which is bigger? \n","29 One half \n","30 I think one half is... \n","31 Yes, [STUDENT_0] and [STUDENT_1]? \n","32 What do you have? \n","33 Well \n","34 we think \n","\n"," Teacher Tag Student Tag \n","25 NaN 4 - Making a Claim \n","26 NaN 4 - Making a Claim \n","27 1 - None NaN \n","28 8 - Press for Accuracy NaN \n","29 NaN 4 - Making a Claim \n","30 NaN 2 - Relating to Another Student \n","31 2 - Keeping Everyone Together NaN \n","32 NaN 2 - Relating to Another Student \n","33 NaN 1 - None \n","34 NaN 1 - None "],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
Unnamed: 0TimeStampTurnSpeakerSentenceTeacher TagStudent Tag
2525NaN14.0[STUDENT_0]Yeah, I know, and put ‘em up to there, and tha...NaN4 - Making a Claim
2626NaN14.0[STUDENT_0]Hey, wait a minute, hey wait, maybe that’s it,...NaN4 - Making a Claim
2727NaN15.0TNow take six of the ones1 - NoneNaN
2828NaN15.0TWhich is bigger?8 - Press for AccuracyNaN
2929NaN16.0[STUDENT_2]One halfNaN4 - Making a Claim
3030NaN17.0[STUDENT_0]I think one half is...NaN2 - Relating to Another Student
3131NaNNaNTYes, [STUDENT_0] and [STUDENT_1]?2 - Keeping Everyone TogetherNaN
3232NaN17.0[STUDENT_0]What do you have?NaN2 - Relating to Another Student
3333NaN17.0[STUDENT_1] and [STUDENT_0]WellNaN1 - None
3434NaN18.0[STUDENT_0]we thinkNaN1 - None
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":9}],"source":["df = processor.anonymize_known_names(\n"," df=df,\n"," text_column=SPEAKER_COLUMN,\n"," names=known_names,\n"," replacement_names=known_replacement_names,\n"," target_text_column=SPEAKER_COLUMN\n",")\n","\n","df.iloc[25:35]"]},{"cell_type":"markdown","metadata":{"id":"OlMZqjFdL17Y"},"source":["🎉 Great, now we have anonymized the speaker names as well! Some other great things are that:\n","- We have a record of the original names and the anonymized names. So if we want to go back to the original names, we can do that.\n","- The anonymized names are consistent: So [STUDENT_0] in the SPEAKER_COLUMN will refer to the same [STUDENT_0] in the TEXT_COLUMN.\n","\n","\n","This concludes the tutorial on anonymizing data with known names.\n","The next section will cover anonymizing data when you do *not* know the names of the students and educators in your dataset."]},{"cell_type":"markdown","metadata":{"id":"UPS9UK-YL17Y"},"source":["## 📝 Anonymizing Data with Unknown Names\n","\n","We will now anonymize the data when we **do not know** the names of the students and educators in the dataset.\n","Note that the anonymization will be imperfect as we do not know the names of the students and educators in the dataset and identifying names consistently is a hard task (rf. [named entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition))---so use this with caution!\n","We will show some of these failure modes in the tutorial."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"wDoDXMC1L17Y","outputId":"34f6a952-9aaa-41f2-c1cd-d21c8e7aab35","colab":{"base_uri":"https://localhost:8080/","height":363},"executionInfo":{"status":"ok","timestamp":1703931933744,"user_tz":480,"elapsed":6,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" Unnamed: 0 TimeStamp Turn Speaker \\\n","25 25 NaN 14.0 David \n","26 26 NaN 14.0 David \n","27 27 NaN 15.0 T \n","28 28 NaN 15.0 T \n","29 29 NaN 16.0 Beth \n","30 30 NaN 17.0 David \n","31 31 NaN NaN T \n","32 32 NaN 17.0 David \n","33 33 NaN 17.0 Meredith and David \n","34 34 NaN 18.0 David \n","\n"," Sentence \\\n","25 Yeah, I know, and put ‘em up to there, and tha... \n","26 Hey, wait a minute, hey wait, maybe that’s it,... \n","27 Now take six of the ones \n","28 Which is bigger? \n","29 One half \n","30 I think one half is... \n","31 Yes, David and Meredith? \n","32 What do you have? \n","33 Well \n","34 we think \n","\n"," Teacher Tag Student Tag \n","25 NaN 4 - Making a Claim \n","26 NaN 4 - Making a Claim \n","27 1 - None NaN \n","28 8 - Press for Accuracy NaN \n","29 NaN 4 - Making a Claim \n","30 NaN 2 - Relating to Another Student \n","31 2 - Keeping Everyone Together NaN \n","32 NaN 2 - Relating to Another Student \n","33 NaN 1 - None \n","34 NaN 1 - None "],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
Unnamed: 0TimeStampTurnSpeakerSentenceTeacher TagStudent Tag
2525NaN14.0DavidYeah, I know, and put ‘em up to there, and tha...NaN4 - Making a Claim
2626NaN14.0DavidHey, wait a minute, hey wait, maybe that’s it,...NaN4 - Making a Claim
2727NaN15.0TNow take six of the ones1 - NoneNaN
2828NaN15.0TWhich is bigger?8 - Press for AccuracyNaN
2929NaN16.0BethOne halfNaN4 - Making a Claim
3030NaN17.0DavidI think one half is...NaN2 - Relating to Another Student
3131NaNNaNTYes, David and Meredith?2 - Keeping Everyone TogetherNaN
3232NaN17.0DavidWhat do you have?NaN2 - Relating to Another Student
3333NaN17.0Meredith and DavidWellNaN1 - None
3434NaN18.0Davidwe thinkNaN1 - None
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":10}],"source":["# Let's start fresh with the original data\n","df = utils.load_data(data_fname)\n","df.iloc[25:35]"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"UZssHkshL17Y","outputId":"e9119804-46ec-41fc-9b24-351e9f80fbd1","colab":{"base_uri":"https://localhost:8080/","height":397},"executionInfo":{"status":"ok","timestamp":1703931938852,"user_tz":480,"elapsed":5114,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Names: ['Beth', 'David']\n","Replacement names: ['[PERSON0]', '[PERSON1]']\n"]},{"output_type":"execute_result","data":{"text/plain":[" Unnamed: 0 TimeStamp Turn Speaker \\\n","25 25 NaN 14.0 [PERSON1] \n","26 26 NaN 14.0 [PERSON1] \n","27 27 NaN 15.0 T \n","28 28 NaN 15.0 T \n","29 29 NaN 16.0 [PERSON0] \n","30 30 NaN 17.0 [PERSON1] \n","31 31 NaN NaN T \n","32 32 NaN 17.0 [PERSON1] \n","33 33 NaN 17.0 Meredith and [PERSON1] \n","34 34 NaN 18.0 [PERSON1] \n","\n"," Sentence \\\n","25 Yeah, I know, and put ‘em up to there, and tha... \n","26 Hey, wait a minute, hey wait, maybe that’s it,... \n","27 Now take six of the ones \n","28 Which is bigger? \n","29 One half \n","30 I think one half is... \n","31 Yes, David and Meredith? \n","32 What do you have? \n","33 Well \n","34 we think \n","\n"," Teacher Tag Student Tag \n","25 NaN 4 - Making a Claim \n","26 NaN 4 - Making a Claim \n","27 1 - None NaN \n","28 8 - Press for Accuracy NaN \n","29 NaN 4 - Making a Claim \n","30 NaN 2 - Relating to Another Student \n","31 2 - Keeping Everyone Together NaN \n","32 NaN 2 - Relating to Another Student \n","33 NaN 1 - None \n","34 NaN 1 - None "],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
Unnamed: 0TimeStampTurnSpeakerSentenceTeacher TagStudent Tag
2525NaN14.0[PERSON1]Yeah, I know, and put ‘em up to there, and tha...NaN4 - Making a Claim
2626NaN14.0[PERSON1]Hey, wait a minute, hey wait, maybe that’s it,...NaN4 - Making a Claim
2727NaN15.0TNow take six of the ones1 - NoneNaN
2828NaN15.0TWhich is bigger?8 - Press for AccuracyNaN
2929NaN16.0[PERSON0]One halfNaN4 - Making a Claim
3030NaN17.0[PERSON1]I think one half is...NaN2 - Relating to Another Student
3131NaNNaNTYes, David and Meredith?2 - Keeping Everyone TogetherNaN
3232NaN17.0[PERSON1]What do you have?NaN2 - Relating to Another Student
3333NaN17.0Meredith and [PERSON1]WellNaN1 - None
3434NaN18.0[PERSON1]we thinkNaN1 - None
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":11}],"source":["processor = TextPreprocessor()\n","df, (names, replacement_names) = processor.anonymize_unknown_names(\n"," df=df,\n"," text_column=SPEAKER_COLUMN,\n"," target_text_column=SPEAKER_COLUMN,\n"," # Will return the names and replacement names that were used.\n"," return_names=True\n",")\n","\n","print(f\"Names: {names}\")\n","print(f\"Replacement names: {replacement_names}\")\n","df.iloc[25:35]\n"]},{"cell_type":"markdown","metadata":{"id":"jCr5oWbVL17Y"},"source":["💡 Note: Observe that the name \"Meredith\" has not been anonymized.\n","`anonymize_unknown_names` currently uses spacY's named entity recognition model to identify names. This is an imperfect model and will not identify all names, as we can see here.\n","\n","There are ways we can improve this. For example:\n","- We can manually add \"Meredith\" to the list of names to anonymize and run `anonymize_known_names` again.\n","- We can cross-reference names from the [SSA database](https://www.ssa.gov/oact/babynames/limits.html) to identify names that are not identified by the model. However, this will lead to a high false positive rate, i.e., names that are not actually names will be identified as names."]},{"cell_type":"markdown","metadata":{"id":"aK_qUAPgL17Y"},"source":["To complete the anonymization process, we will use the `names` and `replacement_names` returned from `anonynmize_unknown_names` to anonymize the text. This makes the anonymization consistent between the speaker and text columns."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"N3j3E-reL17Z","outputId":"696dabc1-8af3-4b88-b958-c0a021ac3d0c","colab":{"base_uri":"https://localhost:8080/","height":363},"executionInfo":{"status":"ok","timestamp":1703931939103,"user_tz":480,"elapsed":256,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" Unnamed: 0 TimeStamp Turn Speaker \\\n","25 25 NaN 14.0 [PERSON1] \n","26 26 NaN 14.0 [PERSON1] \n","27 27 NaN 15.0 T \n","28 28 NaN 15.0 T \n","29 29 NaN 16.0 [PERSON0] \n","30 30 NaN 17.0 [PERSON1] \n","31 31 NaN NaN T \n","32 32 NaN 17.0 [PERSON1] \n","33 33 NaN 17.0 Meredith and [PERSON1] \n","34 34 NaN 18.0 [PERSON1] \n","\n"," Sentence \\\n","25 Yeah, I know, and put ‘em up to there, and tha... \n","26 Hey, wait a minute, hey wait, maybe that’s it,... \n","27 Now take six of the ones \n","28 Which is bigger? \n","29 One half \n","30 I think one half is... \n","31 Yes, [PERSON1] and Meredith? \n","32 What do you have? \n","33 Well \n","34 we think \n","\n"," Teacher Tag Student Tag \n","25 NaN 4 - Making a Claim \n","26 NaN 4 - Making a Claim \n","27 1 - None NaN \n","28 8 - Press for Accuracy NaN \n","29 NaN 4 - Making a Claim \n","30 NaN 2 - Relating to Another Student \n","31 2 - Keeping Everyone Together NaN \n","32 NaN 2 - Relating to Another Student \n","33 NaN 1 - None \n","34 NaN 1 - None "],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
Unnamed: 0TimeStampTurnSpeakerSentenceTeacher TagStudent Tag
2525NaN14.0[PERSON1]Yeah, I know, and put ‘em up to there, and tha...NaN4 - Making a Claim
2626NaN14.0[PERSON1]Hey, wait a minute, hey wait, maybe that’s it,...NaN4 - Making a Claim
2727NaN15.0TNow take six of the ones1 - NoneNaN
2828NaN15.0TWhich is bigger?8 - Press for AccuracyNaN
2929NaN16.0[PERSON0]One halfNaN4 - Making a Claim
3030NaN17.0[PERSON1]I think one half is...NaN2 - Relating to Another Student
3131NaNNaNTYes, [PERSON1] and Meredith?2 - Keeping Everyone TogetherNaN
3232NaN17.0[PERSON1]What do you have?NaN2 - Relating to Another Student
3333NaN17.0Meredith and [PERSON1]WellNaN1 - None
3434NaN18.0[PERSON1]we thinkNaN1 - None
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":12}],"source":["df = processor.anonymize_known_names(\n"," df=df,\n"," text_column=TEXT_COLUMN,\n"," target_text_column=TEXT_COLUMN,\n"," names=names,\n"," replacement_names=replacement_names\n",")\n","\n","# David is anonymized but Meredith is not (rf. line 31).\n","df.iloc[25:35]"]},{"cell_type":"markdown","metadata":{"id":"w_famd8QL17Z"},"source":["## 📝 Standardizing Data for Downstream Annotation and Analysis\n","\n","We will now standardize the data for downstream annotation and analysis.\n","One common standardization is to group the utterances from the same speaker together.\n","We will show how you can do this on the anonymized data.\n","\n","For other standardizations, please refer to [`edu-convokit`'s documentation](TODO), or feel free to add a feature/pull request on our [GitHub](https://github.com/rosewang2008/edu-convokit)."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"HzkWhG82L17Z","outputId":"312992d3-8fd2-458a-e544-973c40d7fa74","colab":{"base_uri":"https://localhost:8080/","height":363},"executionInfo":{"status":"ok","timestamp":1703931945572,"user_tz":480,"elapsed":6473,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" Unnamed: 0 TimeStamp Turn Speaker \\\n","25 25 NaN 14.0 [PERSON1] \n","26 26 NaN 14.0 [PERSON1] \n","27 27 NaN 15.0 T \n","28 28 NaN 15.0 T \n","29 29 NaN 16.0 [PERSON0] \n","30 30 NaN 17.0 [PERSON1] \n","31 31 NaN NaN T \n","32 32 NaN 17.0 [PERSON1] \n","33 33 NaN 17.0 Meredith and [PERSON1] \n","34 34 NaN 18.0 [PERSON1] \n","\n"," Sentence \\\n","25 Yeah, I know, and put ‘em up to there, and tha... \n","26 Hey, wait a minute, hey wait, maybe that’s it,... \n","27 Now take six of the ones \n","28 Which is bigger? \n","29 One half \n","30 I think one half is... \n","31 Yes, [STUDENT_0] and [STUDENT_1]? \n","32 What do you have? \n","33 Well \n","34 we think \n","\n"," Teacher Tag Student Tag \n","25 NaN 4 - Making a Claim \n","26 NaN 4 - Making a Claim \n","27 1 - None NaN \n","28 8 - Press for Accuracy NaN \n","29 NaN 4 - Making a Claim \n","30 NaN 2 - Relating to Another Student \n","31 2 - Keeping Everyone Together NaN \n","32 NaN 2 - Relating to Another Student \n","33 NaN 1 - None \n","34 NaN 1 - None "],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
Unnamed: 0TimeStampTurnSpeakerSentenceTeacher TagStudent Tag
2525NaN14.0[PERSON1]Yeah, I know, and put ‘em up to there, and tha...NaN4 - Making a Claim
2626NaN14.0[PERSON1]Hey, wait a minute, hey wait, maybe that’s it,...NaN4 - Making a Claim
2727NaN15.0TNow take six of the ones1 - NoneNaN
2828NaN15.0TWhich is bigger?8 - Press for AccuracyNaN
2929NaN16.0[PERSON0]One halfNaN4 - Making a Claim
3030NaN17.0[PERSON1]I think one half is...NaN2 - Relating to Another Student
3131NaNNaNTYes, [STUDENT_0] and [STUDENT_1]?2 - Keeping Everyone TogetherNaN
3232NaN17.0[PERSON1]What do you have?NaN2 - Relating to Another Student
3333NaN17.0Meredith and [PERSON1]WellNaN1 - None
3434NaN18.0[PERSON1]we thinkNaN1 - None
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":13}],"source":["# First let's start fresh with the original data & anonymize it like we did before.\n","df = utils.load_data(data_fname)\n","processor = TextPreprocessor()\n","\n","# Anonymize text\n","df = processor.anonymize_known_names(\n"," df=df,\n"," text_column=TEXT_COLUMN,\n"," names=known_names,\n"," replacement_names=known_replacement_names,\n"," target_text_column=TEXT_COLUMN\n",")\n","\n","# Anonymize speakers\n","df, (names, replacement_names) = processor.anonymize_unknown_names(\n"," df=df,\n"," text_column=SPEAKER_COLUMN,\n"," target_text_column=SPEAKER_COLUMN,\n"," return_names=True\n",")\n","\n","df.iloc[25:35]"]},{"cell_type":"markdown","metadata":{"id":"mYgh27y9L17Z"},"source":["Now we'll group utterances from the same speaker together."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"OtQ1KMp6L17Z","outputId":"fdebfb34-f99a-4a04-8a4f-fa69b5426607","colab":{"base_uri":"https://localhost:8080/","height":363},"executionInfo":{"status":"ok","timestamp":1703931945572,"user_tz":480,"elapsed":4,"user":{"displayName":"Rose Wang","userId":"08647070137360066467"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" Sentence Speaker\n","25 dark green [PERSON1]\n","26 If you put it up to a whole Meredith\n","27 I’m sorry, what’s the number name for dark green T\n","28 One Meredith and [PERSON1]\n","29 Ok. T\n","30 And you put six ones up to the dark green Meredith\n","31 Hold on, I’m a little confused. Tell me again.... T\n","32 One sixth Meredith\n","33 One sixth. T\n","34 And then these, this would be [PERSON1]"],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
SentenceSpeaker
25dark green[PERSON1]
26If you put it up to a wholeMeredith
27I’m sorry, what’s the number name for dark greenT
28OneMeredith and [PERSON1]
29Ok.T
30And you put six ones up to the dark greenMeredith
31Hold on, I’m a little confused. Tell me again....T
32One sixthMeredith
33One sixth.T
34And then these, this would be[PERSON1]
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":14}],"source":["df = processor.merge_utterances_from_same_speaker(\n"," df=df,\n"," text_column=TEXT_COLUMN,\n"," speaker_column=SPEAKER_COLUMN,\n"," # We're going to directly replace the text in the text column.\n"," target_text_column=TEXT_COLUMN\n",")\n","\n","df.iloc[25:35]"]},{"cell_type":"markdown","metadata":{"id":"YoiLqghQL17Z"},"source":["We can see that the utterances from the same speaker are now grouped together!"]},{"cell_type":"markdown","metadata":{"id":"LNl0hUUTL17Z"},"source":["## 📝 Conclusion and Where to Go From Here\n","\n","In this tutorial, we learned how to use `TextPreprocessor` to:\n","1. Anonymize your data when you know the names of your students and educators.\n","2. Anonymize your data when you do _not_ know the names of your students and educators.\n","3. Standardize your data for downstream feature annotation.\n","\n","The next natural step is to annotate your data with features of interest. Here are some resources to get you started:\n","- [`edu-convokit`'s documentation on `Annotator`](https://edu-convokit.readthedocs.io/en/latest/annotation.html)\n","- [`edu-convokit`'s tutorial on `Annotator`](https://colab.research.google.com/drive/1rBwEctFtmQowZHxralH2OGT5uV0zRIQw)\n","\n","\n","If you have any questions, please feel free to reach out to us on [`edu-convokit`'s GitHub](https://github.com/rosewang2008/edu-convokit).\n","\n","👋 Happy exploring your data with `edu-convokit`!"]},{"cell_type":"code","source":[],"metadata":{"id":"Xd9Gq3A0qKvI"},"execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.11.5"},"colab":{"provenance":[]}},"nbformat":4,"nbformat_minor":0}