Tutorial on Text Pre-Processing for Education Language Data

Welcome to this tutorial on using `edu-convokit <https://github.com/rosewang2008/edu-convokit>`__ for text pre-processing. Text pre-processing is a critical step in handling education language data. - It ensures the data is clean (education data is notoriously messy). - It ensures the data is standardized, ready for annotation and analysis. - It ensures that the students and educators are anonymized; this is important to protect the privacy of individuals involved and allow for safe secondary data analysis.

edu-convokit is designed to support these purposes.

📚 Learning Objectives

In this tutorial, you will learn how to use TextPreprocessor to:

Section Link 🔗: Anonymize your data when you know the names of your students and educators.
Section Link 🔗: Anonymize your data when you do not know the names of your students and educators.
Section Link 🔗: Standardize your data for downstream feature annotation.

Without further ado, let’s get started!

Installation

Let’s first install edu-convokit.

[ ]:

!pip install git+https://github.com/rosewang2008/edu-convokit.git

Collecting git+https://github.com/rosewang2008/edu-convokit.git
  Cloning https://github.com/rosewang2008/edu-convokit.git to /tmp/pip-req-build-580kdce9
  Running command git clone --filter=blob:none --quiet https://github.com/rosewang2008/edu-convokit.git /tmp/pip-req-build-580kdce9
  Resolved https://github.com/rosewang2008/edu-convokit.git to commit 8eb087b51abfa36a7031bf1de4e3dc40d8848186
  Preparing metadata (setup.py) ... done
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.66.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.23.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.11.4)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.8.1)
Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (2.1.0+cu121)
Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.35.2)
Requirement already satisfied: clean-text in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.6.0)
Requirement already satisfied: openpyxl in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.1.2)
Requirement already satisfied: spacy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.6.1)
Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.3.2)
Requirement already satisfied: num2words==0.5.10 in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.5.10)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.2.2)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.7.1)
Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.12.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (2.1.4)
Requirement already satisfied: docopt>=0.6.2 in /usr/local/lib/python3.10/dist-packages (from num2words==0.5.10->edu-convokit==0.0.1) (0.6.2)
Requirement already satisfied: emoji<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from clean-text->edu-convokit==0.0.1) (1.7.0)
Requirement already satisfied: ftfy<7.0,>=6.0 in /usr/local/lib/python3.10/dist-packages (from clean-text->edu-convokit==0.0.1) (6.1.3)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from gensim->edu-convokit==0.0.1) (6.4.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (4.46.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (23.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (2.8.2)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (2023.6.3)
Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.10/dist-packages (from openpyxl->edu-convokit==0.0.1) (1.1.0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->edu-convokit==0.0.1) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->edu-convokit==0.0.1) (2023.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->edu-convokit==0.0.1) (3.2.0)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.0.10)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.0.8)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.0.9)
Requirement already satisfied: thinc<8.2.0,>=8.1.8 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (8.1.12)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.1.2)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.4.8)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.0.10)
Requirement already satisfied: typer<0.10.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (0.9.0)
Requirement already satisfied: pathy>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (0.10.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.31.0)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.10.13)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.1.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (67.7.2)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.3.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (3.13.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (4.5.0)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (1.12)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (3.2.1)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (2023.6.0)
Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (2.1.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.19.4)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (6.0.1)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.15.0)
Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.4.1)
Requirement already satisfied: wcwidth<0.3.0,>=0.2.12 in /usr/local/lib/python3.10/dist-packages (from ftfy<7.0,>=6.0->clean-text->edu-convokit==0.0.1) (0.2.12)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->edu-convokit==0.0.1) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (2023.11.17)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.2.0,>=8.1.8->spacy->edu-convokit==0.0.1) (0.7.11)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.2.0,>=8.1.8->spacy->edu-convokit==0.0.1) (0.1.4)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy->edu-convokit==0.0.1) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->edu-convokit==0.0.1) (1.3.0)

[ ]:

from edu_convokit.preprocessors import TextPreprocessor

# For helping us flexibly load data
from edu_convokit import utils

📑 Data

Let’s load the data we’ll be working with. We’re going to be using a transcript from the TalkMoves dataset.

[ ]:

!wget "https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves/Boats and Fish 2_Grade 4.xlsx"

data_fname = "Boats and Fish 2_Grade 4.xlsx"
df = utils.load_data(data_fname) # Handles loading data from different file types including: .csv, .xlsx, .json

# Show these lines because they contain names in the speaker and text columns.
df[25:35]

--2023-12-30 10:25:32--  https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves/Boats%20and%20Fish%202_Grade%204.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10528 (10K) [application/octet-stream]
Saving to: ‘Boats and Fish 2_Grade 4.xlsx.3’


  Boats and   0%[                    ]       0  --.-KB/s

Boats and Fish 2_Gr 100%[===================>] 10.28K –.-KB/s in 0s

2023-12-30 10:25:32 (61.8 MB/s) - ‘Boats and Fish 2_Grade 4.xlsx.3’ saved [10528/10528]

</pre>

Boats and Fish 2_Gr 100%[===================>] 10.28K –.-KB/s in 0s

2023-12-30 10:25:32 (61.8 MB/s) - ‘Boats and Fish 2_Grade 4.xlsx.3’ saved [10528/10528]

end{sphinxVerbatim}

Boats and Fish 2_Gr 100%[===================>] 10.28K –.-KB/s in 0s

2023-12-30 10:25:32 (61.8 MB/s) - ‘Boats and Fish 2_Grade 4.xlsx.3’ saved [10528/10528]

	Unnamed: 0	TimeStamp	Turn	Speaker	Sentence	Teacher Tag	Student Tag
25	25	NaN	14.0	David	Yeah, I know, and put ‘em up to there, and tha...	NaN	4 - Making a Claim
26	26	NaN	14.0	David	Hey, wait a minute, hey wait, maybe that’s it,...	NaN	4 - Making a Claim
27	27	NaN	15.0	T	Now take six of the ones	1 - None	NaN
28	28	NaN	15.0	T	Which is bigger?	8 - Press for Accuracy	NaN
29	29	NaN	16.0	Beth	One half	NaN	4 - Making a Claim
30	30	NaN	17.0	David	I think one half is...	NaN	2 - Relating to Another Student
31	31	NaN	NaN	T	Yes, David and Meredith?	2 - Keeping Everyone Together	NaN
32	32	NaN	17.0	David	What do you have?	NaN	2 - Relating to Another Student
33	33	NaN	17.0	Meredith and David	Well	NaN	1 - None
34	34	NaN	18.0	David	we think	NaN	1 - None

Some things to observe about the data…

💡 Note: edu-convokit cares about two key columns: a column for the speaker and a column for the text. - In the TalkMoves dataset, the speaker is in the Speaker column and the text is in the Sentence column. We can create two variables to store these column names as these will be used throughout the tutorial.

💡 Note: We see that the names occur in the speaker and text column - e.g., names like David and Meredith appear in the speaker and text column. - The teacher is always shortened to “T” in the speaker column.

💡 Note: The utterances from the same speaker are not always grouped together. - We’ll fix this in the section on standardizing the data for downstream annotation and analysis.

[ ]:

# Creating variables for the columns we want to use
TEXT_COLUMN = "Sentence"
SPEAKER_COLUMN = "Speaker"

📝 Anonymizing Data with Known Names

We will now anonymize the data when we know the names of the students and educators in the dataset. From our experience, this is the most common scenario in education language data where the names of the students and educators are known. For example, these names come from a roster or a list of students in a class, or are officially recorded in a database.

To do this, we need to create a list of names that we want to anonymize, and a list of replacement names that we want to use to replace the names in the dataset.

[ ]:

# Show the names of the speakers. In your use case, you might load this from a file or database.
print(df[SPEAKER_COLUMN].unique())

['T' 'David' 'Meredith' 'Beth' 'Meredith and David' 'T 2']

[ ]:

# Create list of names and replacement names. We will make the replacement names unique so that we can easily find them later.
known_names = ["David", "Meredith", "Beth"]
known_replacement_names = [f"[STUDENT_{i}]" for i in range(len(known_names))]
print(known_replacement_names)

['[STUDENT_0]', '[STUDENT_1]', '[STUDENT_2]']

[ ]:

# Now let's anonymize the names in the text!
processor = TextPreprocessor()
df = processor.anonymize_known_names(
    df=df,
    text_column=TEXT_COLUMN,
    names=known_names,
    replacement_names=known_replacement_names,
    # We will directly replace the names in the text column.
    # If you want to keep the original text, you can set `target_text_column` to a new column name.
    target_text_column=TEXT_COLUMN
)

[ ]:

# Let's see what the anonymized text looks like!
df.iloc[25:35]

	Unnamed: 0	TimeStamp	Turn	Speaker	Sentence	Teacher Tag	Student Tag
25	25	NaN	14.0	David	Yeah, I know, and put ‘em up to there, and tha...	NaN	4 - Making a Claim
26	26	NaN	14.0	David	Hey, wait a minute, hey wait, maybe that’s it,...	NaN	4 - Making a Claim
27	27	NaN	15.0	T	Now take six of the ones	1 - None	NaN
28	28	NaN	15.0	T	Which is bigger?	8 - Press for Accuracy	NaN
29	29	NaN	16.0	Beth	One half	NaN	4 - Making a Claim
30	30	NaN	17.0	David	I think one half is...	NaN	2 - Relating to Another Student
31	31	NaN	NaN	T	Yes, [STUDENT_0] and [STUDENT_1]?	2 - Keeping Everyone Together	NaN
32	32	NaN	17.0	David	What do you have?	NaN	2 - Relating to Another Student
33	33	NaN	17.0	Meredith and David	Well	NaN	1 - None
34	34	NaN	18.0	David	we think	NaN	1 - None

💡 Note: Nice, we can see that the text has been anonymized (e.g., line 31)!

However, the speaker names have not been anonymized. Let’s fix that.

[ ]:

df = processor.anonymize_known_names(
    df=df,
    text_column=SPEAKER_COLUMN,
    names=known_names,
    replacement_names=known_replacement_names,
    target_text_column=SPEAKER_COLUMN
)

df.iloc[25:35]

	Unnamed: 0	TimeStamp	Turn	Speaker	Sentence	Teacher Tag	Student Tag
25	25	NaN	14.0	[STUDENT_0]	Yeah, I know, and put ‘em up to there, and tha...	NaN	4 - Making a Claim
26	26	NaN	14.0	[STUDENT_0]	Hey, wait a minute, hey wait, maybe that’s it,...	NaN	4 - Making a Claim
27	27	NaN	15.0	T	Now take six of the ones	1 - None	NaN
28	28	NaN	15.0	T	Which is bigger?	8 - Press for Accuracy	NaN
29	29	NaN	16.0	[STUDENT_2]	One half	NaN	4 - Making a Claim
30	30	NaN	17.0	[STUDENT_0]	I think one half is...	NaN	2 - Relating to Another Student
31	31	NaN	NaN	T	Yes, [STUDENT_0] and [STUDENT_1]?	2 - Keeping Everyone Together	NaN
32	32	NaN	17.0	[STUDENT_0]	What do you have?	NaN	2 - Relating to Another Student
33	33	NaN	17.0	[STUDENT_1] and [STUDENT_0]	Well	NaN	1 - None
34	34	NaN	18.0	[STUDENT_0]	we think	NaN	1 - None

🎉 Great, now we have anonymized the speaker names as well! Some other great things are that: - We have a record of the original names and the anonymized names. So if we want to go back to the original names, we can do that. - The anonymized names are consistent: So [STUDENT_0] in the SPEAKER_COLUMN will refer to the same [STUDENT_0] in the TEXT_COLUMN.

This concludes the tutorial on anonymizing data with known names. The next section will cover anonymizing data when you do not know the names of the students and educators in your dataset.

📝 Anonymizing Data with Unknown Names

We will now anonymize the data when we do not know the names of the students and educators in the dataset. Note that the anonymization will be imperfect as we do not know the names of the students and educators in the dataset and identifying names consistently is a hard task (rf. named entity recognition)—so use this with caution! We will show some of these failure modes in the tutorial.

[ ]:

# Let's start fresh with the original data
df = utils.load_data(data_fname)
df.iloc[25:35]

	Unnamed: 0	TimeStamp	Turn	Speaker	Sentence	Teacher Tag	Student Tag
25	25	NaN	14.0	David	Yeah, I know, and put ‘em up to there, and tha...	NaN	4 - Making a Claim
26	26	NaN	14.0	David	Hey, wait a minute, hey wait, maybe that’s it,...	NaN	4 - Making a Claim
27	27	NaN	15.0	T	Now take six of the ones	1 - None	NaN
28	28	NaN	15.0	T	Which is bigger?	8 - Press for Accuracy	NaN
29	29	NaN	16.0	Beth	One half	NaN	4 - Making a Claim
30	30	NaN	17.0	David	I think one half is...	NaN	2 - Relating to Another Student
31	31	NaN	NaN	T	Yes, David and Meredith?	2 - Keeping Everyone Together	NaN
32	32	NaN	17.0	David	What do you have?	NaN	2 - Relating to Another Student
33	33	NaN	17.0	Meredith and David	Well	NaN	1 - None
34	34	NaN	18.0	David	we think	NaN	1 - None

[ ]:

processor = TextPreprocessor()
df, (names, replacement_names) = processor.anonymize_unknown_names(
    df=df,
    text_column=SPEAKER_COLUMN,
    target_text_column=SPEAKER_COLUMN,
    # Will return the names and replacement names that were used.
    return_names=True
)

print(f"Names: {names}")
print(f"Replacement names: {replacement_names}")
df.iloc[25:35]

Names: ['Beth', 'David']
Replacement names: ['[PERSON0]', '[PERSON1]']

	Unnamed: 0	TimeStamp	Turn	Speaker	Sentence	Teacher Tag	Student Tag
25	25	NaN	14.0	[PERSON1]	Yeah, I know, and put ‘em up to there, and tha...	NaN	4 - Making a Claim
26	26	NaN	14.0	[PERSON1]	Hey, wait a minute, hey wait, maybe that’s it,...	NaN	4 - Making a Claim
27	27	NaN	15.0	T	Now take six of the ones	1 - None	NaN
28	28	NaN	15.0	T	Which is bigger?	8 - Press for Accuracy	NaN
29	29	NaN	16.0	[PERSON0]	One half	NaN	4 - Making a Claim
30	30	NaN	17.0	[PERSON1]	I think one half is...	NaN	2 - Relating to Another Student
31	31	NaN	NaN	T	Yes, David and Meredith?	2 - Keeping Everyone Together	NaN
32	32	NaN	17.0	[PERSON1]	What do you have?	NaN	2 - Relating to Another Student
33	33	NaN	17.0	Meredith and [PERSON1]	Well	NaN	1 - None
34	34	NaN	18.0	[PERSON1]	we think	NaN	1 - None

💡 Note: Observe that the name “Meredith” has not been anonymized. anonymize_unknown_names currently uses spacY’s named entity recognition model to identify names. This is an imperfect model and will not identify all names, as we can see here.

There are ways we can improve this. For example: - We can manually add “Meredith” to the list of names to anonymize and run anonymize_known_names again. - We can cross-reference names from the SSA database to identify names that are not identified by the model. However, this will lead to a high false positive rate, i.e., names that are not actually names will be identified as names.

To complete the anonymization process, we will use the names and replacement_names returned from anonynmize_unknown_names to anonymize the text. This makes the anonymization consistent between the speaker and text columns.

[ ]:

df = processor.anonymize_known_names(
    df=df,
    text_column=TEXT_COLUMN,
    target_text_column=TEXT_COLUMN,
    names=names,
    replacement_names=replacement_names
)

# David is anonymized but Meredith is not (rf. line 31).
df.iloc[25:35]

	Unnamed: 0	TimeStamp	Turn	Speaker	Sentence	Teacher Tag	Student Tag
25	25	NaN	14.0	[PERSON1]	Yeah, I know, and put ‘em up to there, and tha...	NaN	4 - Making a Claim
26	26	NaN	14.0	[PERSON1]	Hey, wait a minute, hey wait, maybe that’s it,...	NaN	4 - Making a Claim
27	27	NaN	15.0	T	Now take six of the ones	1 - None	NaN
28	28	NaN	15.0	T	Which is bigger?	8 - Press for Accuracy	NaN
29	29	NaN	16.0	[PERSON0]	One half	NaN	4 - Making a Claim
30	30	NaN	17.0	[PERSON1]	I think one half is...	NaN	2 - Relating to Another Student
31	31	NaN	NaN	T	Yes, [PERSON1] and Meredith?	2 - Keeping Everyone Together	NaN
32	32	NaN	17.0	[PERSON1]	What do you have?	NaN	2 - Relating to Another Student
33	33	NaN	17.0	Meredith and [PERSON1]	Well	NaN	1 - None
34	34	NaN	18.0	[PERSON1]	we think	NaN	1 - None

📝 Standardizing Data for Downstream Annotation and Analysis

We will now standardize the data for downstream annotation and analysis. One common standardization is to group the utterances from the same speaker together. We will show how you can do this on the anonymized data.

For other standardizations, please refer to `edu-convokit’s documentation <TODO>`__, or feel free to add a feature/pull request on our GitHub.

[ ]:

# First let's start fresh with the original data & anonymize it like we did before.
df = utils.load_data(data_fname)
processor = TextPreprocessor()

# Anonymize text
df = processor.anonymize_known_names(
    df=df,
    text_column=TEXT_COLUMN,
    names=known_names,
    replacement_names=known_replacement_names,
    target_text_column=TEXT_COLUMN
)

# Anonymize speakers
df, (names, replacement_names) = processor.anonymize_unknown_names(
    df=df,
    text_column=SPEAKER_COLUMN,
    target_text_column=SPEAKER_COLUMN,
    return_names=True
)

df.iloc[25:35]

	Unnamed: 0	TimeStamp	Turn	Speaker	Sentence	Teacher Tag	Student Tag
25	25	NaN	14.0	[PERSON1]	Yeah, I know, and put ‘em up to there, and tha...	NaN	4 - Making a Claim
26	26	NaN	14.0	[PERSON1]	Hey, wait a minute, hey wait, maybe that’s it,...	NaN	4 - Making a Claim
27	27	NaN	15.0	T	Now take six of the ones	1 - None	NaN
28	28	NaN	15.0	T	Which is bigger?	8 - Press for Accuracy	NaN
29	29	NaN	16.0	[PERSON0]	One half	NaN	4 - Making a Claim
30	30	NaN	17.0	[PERSON1]	I think one half is...	NaN	2 - Relating to Another Student
31	31	NaN	NaN	T	Yes, [STUDENT_0] and [STUDENT_1]?	2 - Keeping Everyone Together	NaN
32	32	NaN	17.0	[PERSON1]	What do you have?	NaN	2 - Relating to Another Student
33	33	NaN	17.0	Meredith and [PERSON1]	Well	NaN	1 - None
34	34	NaN	18.0	[PERSON1]	we think	NaN	1 - None

Now we’ll group utterances from the same speaker together.

[ ]:

df = processor.merge_utterances_from_same_speaker(
    df=df,
    text_column=TEXT_COLUMN,
    speaker_column=SPEAKER_COLUMN,
    # We're going to directly replace the text in the text column.
    target_text_column=TEXT_COLUMN
)

df.iloc[25:35]

	Sentence	Speaker
25	dark green	[PERSON1]
26	If you put it up to a whole	Meredith
27	I’m sorry, what’s the number name for dark green	T
28	One	Meredith and [PERSON1]
29	Ok.	T
30	And you put six ones up to the dark green	Meredith
31	Hold on, I’m a little confused. Tell me again....	T
32	One sixth	Meredith
33	One sixth.	T
34	And then these, this would be	[PERSON1]

We can see that the utterances from the same speaker are now grouped together!

📝 Conclusion and Where to Go From Here

In this tutorial, we learned how to use TextPreprocessor to: 1. Anonymize your data when you know the names of your students and educators. 2. Anonymize your data when you do not know the names of your students and educators. 3. Standardize your data for downstream feature annotation.

The next natural step is to annotate your data with features of interest. Here are some resources to get you started: - `edu-convokit’s documentation on Annotator <https://edu-convokit.readthedocs.io/en/latest/annotation.html>`__ - `edu-convokit’s tutorial on Annotator <https://colab.research.google.com/drive/1rBwEctFtmQowZHxralH2OGT5uV0zRIQw>`__

If you have any questions, please feel free to reach out to us on `edu-convokit’s GitHub <https://github.com/rosewang2008/edu-convokit>`__.

👋 Happy exploring your data with edu-convokit!

[ ]: