Tutorial on Text Pre-Processing for Education Language Data๏
Welcome to this tutorial on using `edu-convokit
<https://github.com/rosewang2008/edu-convokit>`__ for text pre-processing. Text pre-processing is a critical step in handling education language data. - It ensures the data is clean (education data is notoriously messy). - It ensures the data is standardized, ready for annotation and analysis. - It ensures that the students and educators are anonymized; this is important to protect the privacy of individuals involved and allow for safe
secondary data analysis.
edu-convokit
is designed to support these purposes.
๐ Learning Objectives๏
In this tutorial, you will learn how to use TextPreprocessor
to:
Section Link ๐: Anonymize your data when you know the names of your students and educators.
Section Link ๐: Anonymize your data when you do not know the names of your students and educators.
Section Link ๐: Standardize your data for downstream feature annotation.
Without further ado, letโs get started!
Installation๏
Letโs first install edu-convokit
.
[ ]:
!pip install git+https://github.com/rosewang2008/edu-convokit.git
Collecting git+https://github.com/rosewang2008/edu-convokit.git
Cloning https://github.com/rosewang2008/edu-convokit.git to /tmp/pip-req-build-580kdce9
Running command git clone --filter=blob:none --quiet https://github.com/rosewang2008/edu-convokit.git /tmp/pip-req-build-580kdce9
Resolved https://github.com/rosewang2008/edu-convokit.git to commit 8eb087b51abfa36a7031bf1de4e3dc40d8848186
Preparing metadata (setup.py) ... done
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.66.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.23.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.11.4)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.8.1)
Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (2.1.0+cu121)
Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.35.2)
Requirement already satisfied: clean-text in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.6.0)
Requirement already satisfied: openpyxl in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.1.2)
Requirement already satisfied: spacy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.6.1)
Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.3.2)
Requirement already satisfied: num2words==0.5.10 in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.5.10)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.2.2)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.7.1)
Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.12.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (2.1.4)
Requirement already satisfied: docopt>=0.6.2 in /usr/local/lib/python3.10/dist-packages (from num2words==0.5.10->edu-convokit==0.0.1) (0.6.2)
Requirement already satisfied: emoji<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from clean-text->edu-convokit==0.0.1) (1.7.0)
Requirement already satisfied: ftfy<7.0,>=6.0 in /usr/local/lib/python3.10/dist-packages (from clean-text->edu-convokit==0.0.1) (6.1.3)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from gensim->edu-convokit==0.0.1) (6.4.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (4.46.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (23.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (2.8.2)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (2023.6.3)
Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.10/dist-packages (from openpyxl->edu-convokit==0.0.1) (1.1.0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->edu-convokit==0.0.1) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->edu-convokit==0.0.1) (2023.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->edu-convokit==0.0.1) (3.2.0)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.0.10)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.0.8)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.0.9)
Requirement already satisfied: thinc<8.2.0,>=8.1.8 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (8.1.12)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.1.2)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.4.8)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.0.10)
Requirement already satisfied: typer<0.10.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (0.9.0)
Requirement already satisfied: pathy>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (0.10.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.31.0)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.10.13)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.1.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (67.7.2)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.3.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (3.13.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (4.5.0)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (1.12)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (3.2.1)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (2023.6.0)
Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (2.1.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.19.4)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (6.0.1)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.15.0)
Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.4.1)
Requirement already satisfied: wcwidth<0.3.0,>=0.2.12 in /usr/local/lib/python3.10/dist-packages (from ftfy<7.0,>=6.0->clean-text->edu-convokit==0.0.1) (0.2.12)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->edu-convokit==0.0.1) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (2023.11.17)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.2.0,>=8.1.8->spacy->edu-convokit==0.0.1) (0.7.11)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.2.0,>=8.1.8->spacy->edu-convokit==0.0.1) (0.1.4)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy->edu-convokit==0.0.1) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->edu-convokit==0.0.1) (1.3.0)
[ ]:
from edu_convokit.preprocessors import TextPreprocessor
# For helping us flexibly load data
from edu_convokit import utils
๐ Data๏
Letโs load the data weโll be working with. Weโre going to be using a transcript from the TalkMoves dataset.
[ ]:
!wget "https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves/Boats and Fish 2_Grade 4.xlsx"
data_fname = "Boats and Fish 2_Grade 4.xlsx"
df = utils.load_data(data_fname) # Handles loading data from different file types including: .csv, .xlsx, .json
# Show these lines because they contain names in the speaker and text columns.
df[25:35]
--2023-12-30 10:25:32-- https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves/Boats%20and%20Fish%202_Grade%204.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10528 (10K) [application/octet-stream]
Saving to: โBoats and Fish 2_Grade 4.xlsx.3โ
Boats and 0%[ ] 0 --.-KB/s
Boats and Fish 2_Gr 100%[===================>] 10.28K โ.-KB/s in 0s
2023-12-30 10:25:32 (61.8 MB/s) - โBoats and Fish 2_Grade 4.xlsx.3โ saved [10528/10528]
</pre>
Boats and Fish 2_Gr 100%[===================>] 10.28K โ.-KB/s in 0s
2023-12-30 10:25:32 (61.8 MB/s) - โBoats and Fish 2_Grade 4.xlsx.3โ saved [10528/10528]
end{sphinxVerbatim}
Boats and Fish 2_Gr 100%[===================>] 10.28K โ.-KB/s in 0s
2023-12-30 10:25:32 (61.8 MB/s) - โBoats and Fish 2_Grade 4.xlsx.3โ saved [10528/10528]
Unnamed: 0 | TimeStamp | Turn | Speaker | Sentence | Teacher Tag | Student Tag | |
---|---|---|---|---|---|---|---|
25 | 25 | NaN | 14.0 | David | Yeah, I know, and put โem up to there, and tha... | NaN | 4 - Making a Claim |
26 | 26 | NaN | 14.0 | David | Hey, wait a minute, hey wait, maybe thatโs it,... | NaN | 4 - Making a Claim |
27 | 27 | NaN | 15.0 | T | Now take six of the ones | 1 - None | NaN |
28 | 28 | NaN | 15.0 | T | Which is bigger? | 8 - Press for Accuracy | NaN |
29 | 29 | NaN | 16.0 | Beth | One half | NaN | 4 - Making a Claim |
30 | 30 | NaN | 17.0 | David | I think one half is... | NaN | 2 - Relating to Another Student |
31 | 31 | NaN | NaN | T | Yes, David and Meredith? | 2 - Keeping Everyone Together | NaN |
32 | 32 | NaN | 17.0 | David | What do you have? | NaN | 2 - Relating to Another Student |
33 | 33 | NaN | 17.0 | Meredith and David | Well | NaN | 1 - None |
34 | 34 | NaN | 18.0 | David | we think | NaN | 1 - None |
Some things to observe about the dataโฆ๏
๐ก Note: edu-convokit
cares about two key columns: a column for the speaker and a column for the text. - In the TalkMoves dataset, the speaker is in the Speaker
column and the text is in the Sentence
column. We can create two variables to store these column names as these will be used throughout the tutorial.
๐ก Note: We see that the names occur in the speaker and text column - e.g., names like David and Meredith appear in the speaker and text column. - The teacher is always shortened to โTโ in the speaker column.
๐ก Note: The utterances from the same speaker are not always grouped together. - Weโll fix this in the section on standardizing the data for downstream annotation and analysis.
[ ]:
# Creating variables for the columns we want to use
TEXT_COLUMN = "Sentence"
SPEAKER_COLUMN = "Speaker"
๐ Anonymizing Data with Known Names๏
We will now anonymize the data when we know the names of the students and educators in the dataset. From our experience, this is the most common scenario in education language data where the names of the students and educators are known. For example, these names come from a roster or a list of students in a class, or are officially recorded in a database.
To do this, we need to create a list of names that we want to anonymize, and a list of replacement names that we want to use to replace the names in the dataset.
[ ]:
# Show the names of the speakers. In your use case, you might load this from a file or database.
print(df[SPEAKER_COLUMN].unique())
['T' 'David' 'Meredith' 'Beth' 'Meredith and David' 'T 2']
[ ]:
# Create list of names and replacement names. We will make the replacement names unique so that we can easily find them later.
known_names = ["David", "Meredith", "Beth"]
known_replacement_names = [f"[STUDENT_{i}]" for i in range(len(known_names))]
print(known_replacement_names)
['[STUDENT_0]', '[STUDENT_1]', '[STUDENT_2]']
[ ]:
# Now let's anonymize the names in the text!
processor = TextPreprocessor()
df = processor.anonymize_known_names(
df=df,
text_column=TEXT_COLUMN,
names=known_names,
replacement_names=known_replacement_names,
# We will directly replace the names in the text column.
# If you want to keep the original text, you can set `target_text_column` to a new column name.
target_text_column=TEXT_COLUMN
)
[ ]:
# Let's see what the anonymized text looks like!
df.iloc[25:35]
Unnamed: 0 | TimeStamp | Turn | Speaker | Sentence | Teacher Tag | Student Tag | |
---|---|---|---|---|---|---|---|
25 | 25 | NaN | 14.0 | David | Yeah, I know, and put โem up to there, and tha... | NaN | 4 - Making a Claim |
26 | 26 | NaN | 14.0 | David | Hey, wait a minute, hey wait, maybe thatโs it,... | NaN | 4 - Making a Claim |
27 | 27 | NaN | 15.0 | T | Now take six of the ones | 1 - None | NaN |
28 | 28 | NaN | 15.0 | T | Which is bigger? | 8 - Press for Accuracy | NaN |
29 | 29 | NaN | 16.0 | Beth | One half | NaN | 4 - Making a Claim |
30 | 30 | NaN | 17.0 | David | I think one half is... | NaN | 2 - Relating to Another Student |
31 | 31 | NaN | NaN | T | Yes, [STUDENT_0] and [STUDENT_1]? | 2 - Keeping Everyone Together | NaN |
32 | 32 | NaN | 17.0 | David | What do you have? | NaN | 2 - Relating to Another Student |
33 | 33 | NaN | 17.0 | Meredith and David | Well | NaN | 1 - None |
34 | 34 | NaN | 18.0 | David | we think | NaN | 1 - None |
๐ก Note: Nice, we can see that the text has been anonymized (e.g., line 31)!
However, the speaker names have not been anonymized. Letโs fix that.
[ ]:
df = processor.anonymize_known_names(
df=df,
text_column=SPEAKER_COLUMN,
names=known_names,
replacement_names=known_replacement_names,
target_text_column=SPEAKER_COLUMN
)
df.iloc[25:35]
Unnamed: 0 | TimeStamp | Turn | Speaker | Sentence | Teacher Tag | Student Tag | |
---|---|---|---|---|---|---|---|
25 | 25 | NaN | 14.0 | [STUDENT_0] | Yeah, I know, and put โem up to there, and tha... | NaN | 4 - Making a Claim |
26 | 26 | NaN | 14.0 | [STUDENT_0] | Hey, wait a minute, hey wait, maybe thatโs it,... | NaN | 4 - Making a Claim |
27 | 27 | NaN | 15.0 | T | Now take six of the ones | 1 - None | NaN |
28 | 28 | NaN | 15.0 | T | Which is bigger? | 8 - Press for Accuracy | NaN |
29 | 29 | NaN | 16.0 | [STUDENT_2] | One half | NaN | 4 - Making a Claim |
30 | 30 | NaN | 17.0 | [STUDENT_0] | I think one half is... | NaN | 2 - Relating to Another Student |
31 | 31 | NaN | NaN | T | Yes, [STUDENT_0] and [STUDENT_1]? | 2 - Keeping Everyone Together | NaN |
32 | 32 | NaN | 17.0 | [STUDENT_0] | What do you have? | NaN | 2 - Relating to Another Student |
33 | 33 | NaN | 17.0 | [STUDENT_1] and [STUDENT_0] | Well | NaN | 1 - None |
34 | 34 | NaN | 18.0 | [STUDENT_0] | we think | NaN | 1 - None |
๐ Great, now we have anonymized the speaker names as well! Some other great things are that: - We have a record of the original names and the anonymized names. So if we want to go back to the original names, we can do that. - The anonymized names are consistent: So [STUDENT_0] in the SPEAKER_COLUMN will refer to the same [STUDENT_0] in the TEXT_COLUMN.
This concludes the tutorial on anonymizing data with known names. The next section will cover anonymizing data when you do not know the names of the students and educators in your dataset.
๐ Anonymizing Data with Unknown Names๏
We will now anonymize the data when we do not know the names of the students and educators in the dataset. Note that the anonymization will be imperfect as we do not know the names of the students and educators in the dataset and identifying names consistently is a hard task (rf. named entity recognition)โso use this with caution! We will show some of these failure modes in the tutorial.
[ ]:
# Let's start fresh with the original data
df = utils.load_data(data_fname)
df.iloc[25:35]
Unnamed: 0 | TimeStamp | Turn | Speaker | Sentence | Teacher Tag | Student Tag | |
---|---|---|---|---|---|---|---|
25 | 25 | NaN | 14.0 | David | Yeah, I know, and put โem up to there, and tha... | NaN | 4 - Making a Claim |
26 | 26 | NaN | 14.0 | David | Hey, wait a minute, hey wait, maybe thatโs it,... | NaN | 4 - Making a Claim |
27 | 27 | NaN | 15.0 | T | Now take six of the ones | 1 - None | NaN |
28 | 28 | NaN | 15.0 | T | Which is bigger? | 8 - Press for Accuracy | NaN |
29 | 29 | NaN | 16.0 | Beth | One half | NaN | 4 - Making a Claim |
30 | 30 | NaN | 17.0 | David | I think one half is... | NaN | 2 - Relating to Another Student |
31 | 31 | NaN | NaN | T | Yes, David and Meredith? | 2 - Keeping Everyone Together | NaN |
32 | 32 | NaN | 17.0 | David | What do you have? | NaN | 2 - Relating to Another Student |
33 | 33 | NaN | 17.0 | Meredith and David | Well | NaN | 1 - None |
34 | 34 | NaN | 18.0 | David | we think | NaN | 1 - None |
[ ]:
processor = TextPreprocessor()
df, (names, replacement_names) = processor.anonymize_unknown_names(
df=df,
text_column=SPEAKER_COLUMN,
target_text_column=SPEAKER_COLUMN,
# Will return the names and replacement names that were used.
return_names=True
)
print(f"Names: {names}")
print(f"Replacement names: {replacement_names}")
df.iloc[25:35]
Names: ['Beth', 'David']
Replacement names: ['[PERSON0]', '[PERSON1]']
Unnamed: 0 | TimeStamp | Turn | Speaker | Sentence | Teacher Tag | Student Tag | |
---|---|---|---|---|---|---|---|
25 | 25 | NaN | 14.0 | [PERSON1] | Yeah, I know, and put โem up to there, and tha... | NaN | 4 - Making a Claim |
26 | 26 | NaN | 14.0 | [PERSON1] | Hey, wait a minute, hey wait, maybe thatโs it,... | NaN | 4 - Making a Claim |
27 | 27 | NaN | 15.0 | T | Now take six of the ones | 1 - None | NaN |
28 | 28 | NaN | 15.0 | T | Which is bigger? | 8 - Press for Accuracy | NaN |
29 | 29 | NaN | 16.0 | [PERSON0] | One half | NaN | 4 - Making a Claim |
30 | 30 | NaN | 17.0 | [PERSON1] | I think one half is... | NaN | 2 - Relating to Another Student |
31 | 31 | NaN | NaN | T | Yes, David and Meredith? | 2 - Keeping Everyone Together | NaN |
32 | 32 | NaN | 17.0 | [PERSON1] | What do you have? | NaN | 2 - Relating to Another Student |
33 | 33 | NaN | 17.0 | Meredith and [PERSON1] | Well | NaN | 1 - None |
34 | 34 | NaN | 18.0 | [PERSON1] | we think | NaN | 1 - None |
๐ก Note: Observe that the name โMeredithโ has not been anonymized. anonymize_unknown_names
currently uses spacYโs named entity recognition model to identify names. This is an imperfect model and will not identify all names, as we can see here.
There are ways we can improve this. For example: - We can manually add โMeredithโ to the list of names to anonymize and run anonymize_known_names
again. - We can cross-reference names from the SSA database to identify names that are not identified by the model. However, this will lead to a high false positive rate, i.e., names that are not actually names will be identified as names.
To complete the anonymization process, we will use the names
and replacement_names
returned from anonynmize_unknown_names
to anonymize the text. This makes the anonymization consistent between the speaker and text columns.
[ ]:
df = processor.anonymize_known_names(
df=df,
text_column=TEXT_COLUMN,
target_text_column=TEXT_COLUMN,
names=names,
replacement_names=replacement_names
)
# David is anonymized but Meredith is not (rf. line 31).
df.iloc[25:35]
Unnamed: 0 | TimeStamp | Turn | Speaker | Sentence | Teacher Tag | Student Tag | |
---|---|---|---|---|---|---|---|
25 | 25 | NaN | 14.0 | [PERSON1] | Yeah, I know, and put โem up to there, and tha... | NaN | 4 - Making a Claim |
26 | 26 | NaN | 14.0 | [PERSON1] | Hey, wait a minute, hey wait, maybe thatโs it,... | NaN | 4 - Making a Claim |
27 | 27 | NaN | 15.0 | T | Now take six of the ones | 1 - None | NaN |
28 | 28 | NaN | 15.0 | T | Which is bigger? | 8 - Press for Accuracy | NaN |
29 | 29 | NaN | 16.0 | [PERSON0] | One half | NaN | 4 - Making a Claim |
30 | 30 | NaN | 17.0 | [PERSON1] | I think one half is... | NaN | 2 - Relating to Another Student |
31 | 31 | NaN | NaN | T | Yes, [PERSON1] and Meredith? | 2 - Keeping Everyone Together | NaN |
32 | 32 | NaN | 17.0 | [PERSON1] | What do you have? | NaN | 2 - Relating to Another Student |
33 | 33 | NaN | 17.0 | Meredith and [PERSON1] | Well | NaN | 1 - None |
34 | 34 | NaN | 18.0 | [PERSON1] | we think | NaN | 1 - None |
๐ Standardizing Data for Downstream Annotation and Analysis๏
We will now standardize the data for downstream annotation and analysis. One common standardization is to group the utterances from the same speaker together. We will show how you can do this on the anonymized data.
For other standardizations, please refer to `edu-convokit
โs documentation <TODO>`__, or feel free to add a feature/pull request on our GitHub.
[ ]:
# First let's start fresh with the original data & anonymize it like we did before.
df = utils.load_data(data_fname)
processor = TextPreprocessor()
# Anonymize text
df = processor.anonymize_known_names(
df=df,
text_column=TEXT_COLUMN,
names=known_names,
replacement_names=known_replacement_names,
target_text_column=TEXT_COLUMN
)
# Anonymize speakers
df, (names, replacement_names) = processor.anonymize_unknown_names(
df=df,
text_column=SPEAKER_COLUMN,
target_text_column=SPEAKER_COLUMN,
return_names=True
)
df.iloc[25:35]
Unnamed: 0 | TimeStamp | Turn | Speaker | Sentence | Teacher Tag | Student Tag | |
---|---|---|---|---|---|---|---|
25 | 25 | NaN | 14.0 | [PERSON1] | Yeah, I know, and put โem up to there, and tha... | NaN | 4 - Making a Claim |
26 | 26 | NaN | 14.0 | [PERSON1] | Hey, wait a minute, hey wait, maybe thatโs it,... | NaN | 4 - Making a Claim |
27 | 27 | NaN | 15.0 | T | Now take six of the ones | 1 - None | NaN |
28 | 28 | NaN | 15.0 | T | Which is bigger? | 8 - Press for Accuracy | NaN |
29 | 29 | NaN | 16.0 | [PERSON0] | One half | NaN | 4 - Making a Claim |
30 | 30 | NaN | 17.0 | [PERSON1] | I think one half is... | NaN | 2 - Relating to Another Student |
31 | 31 | NaN | NaN | T | Yes, [STUDENT_0] and [STUDENT_1]? | 2 - Keeping Everyone Together | NaN |
32 | 32 | NaN | 17.0 | [PERSON1] | What do you have? | NaN | 2 - Relating to Another Student |
33 | 33 | NaN | 17.0 | Meredith and [PERSON1] | Well | NaN | 1 - None |
34 | 34 | NaN | 18.0 | [PERSON1] | we think | NaN | 1 - None |
Now weโll group utterances from the same speaker together.
[ ]:
df = processor.merge_utterances_from_same_speaker(
df=df,
text_column=TEXT_COLUMN,
speaker_column=SPEAKER_COLUMN,
# We're going to directly replace the text in the text column.
target_text_column=TEXT_COLUMN
)
df.iloc[25:35]
Sentence | Speaker | |
---|---|---|
25 | dark green | [PERSON1] |
26 | If you put it up to a whole | Meredith |
27 | Iโm sorry, whatโs the number name for dark green | T |
28 | One | Meredith and [PERSON1] |
29 | Ok. | T |
30 | And you put six ones up to the dark green | Meredith |
31 | Hold on, Iโm a little confused. Tell me again.... | T |
32 | One sixth | Meredith |
33 | One sixth. | T |
34 | And then these, this would be | [PERSON1] |
We can see that the utterances from the same speaker are now grouped together!
๐ Conclusion and Where to Go From Here๏
In this tutorial, we learned how to use TextPreprocessor
to: 1. Anonymize your data when you know the names of your students and educators. 2. Anonymize your data when you do not know the names of your students and educators. 3. Standardize your data for downstream feature annotation.
The next natural step is to annotate your data with features of interest. Here are some resources to get you started: - `edu-convokit
โs documentation on Annotator
<https://edu-convokit.readthedocs.io/en/latest/annotation.html>`__ - `edu-convokit
โs tutorial on Annotator
<https://colab.research.google.com/drive/1rBwEctFtmQowZHxralH2OGT5uV0zRIQw>`__
If you have any questions, please feel free to reach out to us on `edu-convokit
โs GitHub <https://github.com/rosewang2008/edu-convokit>`__.
๐ Happy exploring your data with edu-convokit
!
[ ]: