Tutorial on Text Pre-Processing for Education Language Data๏ƒ

Welcome to this tutorial on using `edu-convokit <https://github.com/rosewang2008/edu-convokit>`__ for text pre-processing. Text pre-processing is a critical step in handling education language data. - It ensures the data is clean (education data is notoriously messy). - It ensures the data is standardized, ready for annotation and analysis. - It ensures that the students and educators are anonymized; this is important to protect the privacy of individuals involved and allow for safe secondary data analysis.

edu-convokit is designed to support these purposes.

๐Ÿ“š Learning Objectives๏ƒ

In this tutorial, you will learn how to use TextPreprocessor to:

  • Section Link ๐Ÿ”—: Anonymize your data when you know the names of your students and educators.

  • Section Link ๐Ÿ”—: Anonymize your data when you do not know the names of your students and educators.

  • Section Link ๐Ÿ”—: Standardize your data for downstream feature annotation.

Without further ado, letโ€™s get started!

Installation๏ƒ

Letโ€™s first install edu-convokit.

[ ]:
!pip install git+https://github.com/rosewang2008/edu-convokit.git
Collecting git+https://github.com/rosewang2008/edu-convokit.git
  Cloning https://github.com/rosewang2008/edu-convokit.git to /tmp/pip-req-build-580kdce9
  Running command git clone --filter=blob:none --quiet https://github.com/rosewang2008/edu-convokit.git /tmp/pip-req-build-580kdce9
  Resolved https://github.com/rosewang2008/edu-convokit.git to commit 8eb087b51abfa36a7031bf1de4e3dc40d8848186
  Preparing metadata (setup.py) ... done
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.66.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.23.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.11.4)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.8.1)
Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (2.1.0+cu121)
Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.35.2)
Requirement already satisfied: clean-text in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.6.0)
Requirement already satisfied: openpyxl in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.1.2)
Requirement already satisfied: spacy in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.6.1)
Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (4.3.2)
Requirement already satisfied: num2words==0.5.10 in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.5.10)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (1.2.2)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (3.7.1)
Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (0.12.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from edu-convokit==0.0.1) (2.1.4)
Requirement already satisfied: docopt>=0.6.2 in /usr/local/lib/python3.10/dist-packages (from num2words==0.5.10->edu-convokit==0.0.1) (0.6.2)
Requirement already satisfied: emoji<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from clean-text->edu-convokit==0.0.1) (1.7.0)
Requirement already satisfied: ftfy<7.0,>=6.0 in /usr/local/lib/python3.10/dist-packages (from clean-text->edu-convokit==0.0.1) (6.1.3)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from gensim->edu-convokit==0.0.1) (6.4.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (4.46.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (23.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->edu-convokit==0.0.1) (2.8.2)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk->edu-convokit==0.0.1) (2023.6.3)
Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.10/dist-packages (from openpyxl->edu-convokit==0.0.1) (1.1.0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->edu-convokit==0.0.1) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->edu-convokit==0.0.1) (2023.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->edu-convokit==0.0.1) (3.2.0)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.0.10)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.0.8)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.0.9)
Requirement already satisfied: thinc<8.2.0,>=8.1.8 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (8.1.12)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.1.2)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.4.8)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.0.10)
Requirement already satisfied: typer<0.10.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (0.9.0)
Requirement already satisfied: pathy>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (0.10.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (2.31.0)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (1.10.13)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.1.2)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (67.7.2)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy->edu-convokit==0.0.1) (3.3.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (3.13.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (4.5.0)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (1.12)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (3.2.1)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (2023.6.0)
Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch->edu-convokit==0.0.1) (2.1.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.19.4)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (6.0.1)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.15.0)
Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers->edu-convokit==0.0.1) (0.4.1)
Requirement already satisfied: wcwidth<0.3.0,>=0.2.12 in /usr/local/lib/python3.10/dist-packages (from ftfy<7.0,>=6.0->clean-text->edu-convokit==0.0.1) (0.2.12)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->edu-convokit==0.0.1) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy->edu-convokit==0.0.1) (2023.11.17)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.2.0,>=8.1.8->spacy->edu-convokit==0.0.1) (0.7.11)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.2.0,>=8.1.8->spacy->edu-convokit==0.0.1) (0.1.4)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy->edu-convokit==0.0.1) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->edu-convokit==0.0.1) (1.3.0)
[ ]:
from edu_convokit.preprocessors import TextPreprocessor

# For helping us flexibly load data
from edu_convokit import utils

๐Ÿ“‘ Data๏ƒ

Letโ€™s load the data weโ€™ll be working with. Weโ€™re going to be using a transcript from the TalkMoves dataset.

[ ]:
!wget "https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves/Boats and Fish 2_Grade 4.xlsx"

data_fname = "Boats and Fish 2_Grade 4.xlsx"
df = utils.load_data(data_fname) # Handles loading data from different file types including: .csv, .xlsx, .json

# Show these lines because they contain names in the speaker and text columns.
df[25:35]
--2023-12-30 10:25:32--  https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves/Boats%20and%20Fish%202_Grade%204.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10528 (10K) [application/octet-stream]
Saving to: โ€˜Boats and Fish 2_Grade 4.xlsx.3โ€™


  Boats and   0%[                    ]       0  --.-KB/s

Boats and Fish 2_Gr 100%[===================&gt;] 10.28K โ€“.-KB/s in 0s

2023-12-30 10:25:32 (61.8 MB/s) - โ€˜Boats and Fish 2_Grade 4.xlsx.3โ€™ saved [10528/10528]

</pre>

Boats and Fish 2_Gr 100%[===================>] 10.28K โ€“.-KB/s in 0s

2023-12-30 10:25:32 (61.8 MB/s) - โ€˜Boats and Fish 2_Grade 4.xlsx.3โ€™ saved [10528/10528]

end{sphinxVerbatim}

Boats and Fish 2_Gr 100%[===================>] 10.28K โ€“.-KB/s in 0s

2023-12-30 10:25:32 (61.8 MB/s) - โ€˜Boats and Fish 2_Grade 4.xlsx.3โ€™ saved [10528/10528]

Unnamed: 0 TimeStamp Turn Speaker Sentence Teacher Tag Student Tag
25 25 NaN 14.0 David Yeah, I know, and put โ€˜em up to there, and tha... NaN 4 - Making a Claim
26 26 NaN 14.0 David Hey, wait a minute, hey wait, maybe thatโ€™s it,... NaN 4 - Making a Claim
27 27 NaN 15.0 T Now take six of the ones 1 - None NaN
28 28 NaN 15.0 T Which is bigger? 8 - Press for Accuracy NaN
29 29 NaN 16.0 Beth One half NaN 4 - Making a Claim
30 30 NaN 17.0 David I think one half is... NaN 2 - Relating to Another Student
31 31 NaN NaN T Yes, David and Meredith? 2 - Keeping Everyone Together NaN
32 32 NaN 17.0 David What do you have? NaN 2 - Relating to Another Student
33 33 NaN 17.0 Meredith and David Well NaN 1 - None
34 34 NaN 18.0 David we think NaN 1 - None

Some things to observe about the dataโ€ฆ๏ƒ

๐Ÿ’ก Note: edu-convokit cares about two key columns: a column for the speaker and a column for the text. - In the TalkMoves dataset, the speaker is in the Speaker column and the text is in the Sentence column. We can create two variables to store these column names as these will be used throughout the tutorial.

๐Ÿ’ก Note: We see that the names occur in the speaker and text column - e.g., names like David and Meredith appear in the speaker and text column. - The teacher is always shortened to โ€œTโ€ in the speaker column.

๐Ÿ’ก Note: The utterances from the same speaker are not always grouped together. - Weโ€™ll fix this in the section on standardizing the data for downstream annotation and analysis.

[ ]:
# Creating variables for the columns we want to use
TEXT_COLUMN = "Sentence"
SPEAKER_COLUMN = "Speaker"

๐Ÿ“ Anonymizing Data with Known Names๏ƒ

We will now anonymize the data when we know the names of the students and educators in the dataset. From our experience, this is the most common scenario in education language data where the names of the students and educators are known. For example, these names come from a roster or a list of students in a class, or are officially recorded in a database.

To do this, we need to create a list of names that we want to anonymize, and a list of replacement names that we want to use to replace the names in the dataset.

[ ]:
# Show the names of the speakers. In your use case, you might load this from a file or database.
print(df[SPEAKER_COLUMN].unique())
['T' 'David' 'Meredith' 'Beth' 'Meredith and David' 'T 2']
[ ]:
# Create list of names and replacement names. We will make the replacement names unique so that we can easily find them later.
known_names = ["David", "Meredith", "Beth"]
known_replacement_names = [f"[STUDENT_{i}]" for i in range(len(known_names))]
print(known_replacement_names)
['[STUDENT_0]', '[STUDENT_1]', '[STUDENT_2]']
[ ]:
# Now let's anonymize the names in the text!
processor = TextPreprocessor()
df = processor.anonymize_known_names(
    df=df,
    text_column=TEXT_COLUMN,
    names=known_names,
    replacement_names=known_replacement_names,
    # We will directly replace the names in the text column.
    # If you want to keep the original text, you can set `target_text_column` to a new column name.
    target_text_column=TEXT_COLUMN
)
[ ]:
# Let's see what the anonymized text looks like!
df.iloc[25:35]
Unnamed: 0 TimeStamp Turn Speaker Sentence Teacher Tag Student Tag
25 25 NaN 14.0 David Yeah, I know, and put โ€˜em up to there, and tha... NaN 4 - Making a Claim
26 26 NaN 14.0 David Hey, wait a minute, hey wait, maybe thatโ€™s it,... NaN 4 - Making a Claim
27 27 NaN 15.0 T Now take six of the ones 1 - None NaN
28 28 NaN 15.0 T Which is bigger? 8 - Press for Accuracy NaN
29 29 NaN 16.0 Beth One half NaN 4 - Making a Claim
30 30 NaN 17.0 David I think one half is... NaN 2 - Relating to Another Student
31 31 NaN NaN T Yes, [STUDENT_0] and [STUDENT_1]? 2 - Keeping Everyone Together NaN
32 32 NaN 17.0 David What do you have? NaN 2 - Relating to Another Student
33 33 NaN 17.0 Meredith and David Well NaN 1 - None
34 34 NaN 18.0 David we think NaN 1 - None

๐Ÿ’ก Note: Nice, we can see that the text has been anonymized (e.g., line 31)!

However, the speaker names have not been anonymized. Letโ€™s fix that.

[ ]:
df = processor.anonymize_known_names(
    df=df,
    text_column=SPEAKER_COLUMN,
    names=known_names,
    replacement_names=known_replacement_names,
    target_text_column=SPEAKER_COLUMN
)

df.iloc[25:35]
Unnamed: 0 TimeStamp Turn Speaker Sentence Teacher Tag Student Tag
25 25 NaN 14.0 [STUDENT_0] Yeah, I know, and put โ€˜em up to there, and tha... NaN 4 - Making a Claim
26 26 NaN 14.0 [STUDENT_0] Hey, wait a minute, hey wait, maybe thatโ€™s it,... NaN 4 - Making a Claim
27 27 NaN 15.0 T Now take six of the ones 1 - None NaN
28 28 NaN 15.0 T Which is bigger? 8 - Press for Accuracy NaN
29 29 NaN 16.0 [STUDENT_2] One half NaN 4 - Making a Claim
30 30 NaN 17.0 [STUDENT_0] I think one half is... NaN 2 - Relating to Another Student
31 31 NaN NaN T Yes, [STUDENT_0] and [STUDENT_1]? 2 - Keeping Everyone Together NaN
32 32 NaN 17.0 [STUDENT_0] What do you have? NaN 2 - Relating to Another Student
33 33 NaN 17.0 [STUDENT_1] and [STUDENT_0] Well NaN 1 - None
34 34 NaN 18.0 [STUDENT_0] we think NaN 1 - None

๐ŸŽ‰ Great, now we have anonymized the speaker names as well! Some other great things are that: - We have a record of the original names and the anonymized names. So if we want to go back to the original names, we can do that. - The anonymized names are consistent: So [STUDENT_0] in the SPEAKER_COLUMN will refer to the same [STUDENT_0] in the TEXT_COLUMN.

This concludes the tutorial on anonymizing data with known names. The next section will cover anonymizing data when you do not know the names of the students and educators in your dataset.

๐Ÿ“ Anonymizing Data with Unknown Names๏ƒ

We will now anonymize the data when we do not know the names of the students and educators in the dataset. Note that the anonymization will be imperfect as we do not know the names of the students and educators in the dataset and identifying names consistently is a hard task (rf. named entity recognition)โ€”so use this with caution! We will show some of these failure modes in the tutorial.

[ ]:
# Let's start fresh with the original data
df = utils.load_data(data_fname)
df.iloc[25:35]
Unnamed: 0 TimeStamp Turn Speaker Sentence Teacher Tag Student Tag
25 25 NaN 14.0 David Yeah, I know, and put โ€˜em up to there, and tha... NaN 4 - Making a Claim
26 26 NaN 14.0 David Hey, wait a minute, hey wait, maybe thatโ€™s it,... NaN 4 - Making a Claim
27 27 NaN 15.0 T Now take six of the ones 1 - None NaN
28 28 NaN 15.0 T Which is bigger? 8 - Press for Accuracy NaN
29 29 NaN 16.0 Beth One half NaN 4 - Making a Claim
30 30 NaN 17.0 David I think one half is... NaN 2 - Relating to Another Student
31 31 NaN NaN T Yes, David and Meredith? 2 - Keeping Everyone Together NaN
32 32 NaN 17.0 David What do you have? NaN 2 - Relating to Another Student
33 33 NaN 17.0 Meredith and David Well NaN 1 - None
34 34 NaN 18.0 David we think NaN 1 - None
[ ]:
processor = TextPreprocessor()
df, (names, replacement_names) = processor.anonymize_unknown_names(
    df=df,
    text_column=SPEAKER_COLUMN,
    target_text_column=SPEAKER_COLUMN,
    # Will return the names and replacement names that were used.
    return_names=True
)

print(f"Names: {names}")
print(f"Replacement names: {replacement_names}")
df.iloc[25:35]

Names: ['Beth', 'David']
Replacement names: ['[PERSON0]', '[PERSON1]']
Unnamed: 0 TimeStamp Turn Speaker Sentence Teacher Tag Student Tag
25 25 NaN 14.0 [PERSON1] Yeah, I know, and put โ€˜em up to there, and tha... NaN 4 - Making a Claim
26 26 NaN 14.0 [PERSON1] Hey, wait a minute, hey wait, maybe thatโ€™s it,... NaN 4 - Making a Claim
27 27 NaN 15.0 T Now take six of the ones 1 - None NaN
28 28 NaN 15.0 T Which is bigger? 8 - Press for Accuracy NaN
29 29 NaN 16.0 [PERSON0] One half NaN 4 - Making a Claim
30 30 NaN 17.0 [PERSON1] I think one half is... NaN 2 - Relating to Another Student
31 31 NaN NaN T Yes, David and Meredith? 2 - Keeping Everyone Together NaN
32 32 NaN 17.0 [PERSON1] What do you have? NaN 2 - Relating to Another Student
33 33 NaN 17.0 Meredith and [PERSON1] Well NaN 1 - None
34 34 NaN 18.0 [PERSON1] we think NaN 1 - None

๐Ÿ’ก Note: Observe that the name โ€œMeredithโ€ has not been anonymized. anonymize_unknown_names currently uses spacYโ€™s named entity recognition model to identify names. This is an imperfect model and will not identify all names, as we can see here.

There are ways we can improve this. For example: - We can manually add โ€œMeredithโ€ to the list of names to anonymize and run anonymize_known_names again. - We can cross-reference names from the SSA database to identify names that are not identified by the model. However, this will lead to a high false positive rate, i.e., names that are not actually names will be identified as names.

To complete the anonymization process, we will use the names and replacement_names returned from anonynmize_unknown_names to anonymize the text. This makes the anonymization consistent between the speaker and text columns.

[ ]:
df = processor.anonymize_known_names(
    df=df,
    text_column=TEXT_COLUMN,
    target_text_column=TEXT_COLUMN,
    names=names,
    replacement_names=replacement_names
)

# David is anonymized but Meredith is not (rf. line 31).
df.iloc[25:35]
Unnamed: 0 TimeStamp Turn Speaker Sentence Teacher Tag Student Tag
25 25 NaN 14.0 [PERSON1] Yeah, I know, and put โ€˜em up to there, and tha... NaN 4 - Making a Claim
26 26 NaN 14.0 [PERSON1] Hey, wait a minute, hey wait, maybe thatโ€™s it,... NaN 4 - Making a Claim
27 27 NaN 15.0 T Now take six of the ones 1 - None NaN
28 28 NaN 15.0 T Which is bigger? 8 - Press for Accuracy NaN
29 29 NaN 16.0 [PERSON0] One half NaN 4 - Making a Claim
30 30 NaN 17.0 [PERSON1] I think one half is... NaN 2 - Relating to Another Student
31 31 NaN NaN T Yes, [PERSON1] and Meredith? 2 - Keeping Everyone Together NaN
32 32 NaN 17.0 [PERSON1] What do you have? NaN 2 - Relating to Another Student
33 33 NaN 17.0 Meredith and [PERSON1] Well NaN 1 - None
34 34 NaN 18.0 [PERSON1] we think NaN 1 - None

๐Ÿ“ Standardizing Data for Downstream Annotation and Analysis๏ƒ

We will now standardize the data for downstream annotation and analysis. One common standardization is to group the utterances from the same speaker together. We will show how you can do this on the anonymized data.

For other standardizations, please refer to `edu-convokitโ€™s documentation <TODO>`__, or feel free to add a feature/pull request on our GitHub.

[ ]:
# First let's start fresh with the original data & anonymize it like we did before.
df = utils.load_data(data_fname)
processor = TextPreprocessor()

# Anonymize text
df = processor.anonymize_known_names(
    df=df,
    text_column=TEXT_COLUMN,
    names=known_names,
    replacement_names=known_replacement_names,
    target_text_column=TEXT_COLUMN
)

# Anonymize speakers
df, (names, replacement_names) = processor.anonymize_unknown_names(
    df=df,
    text_column=SPEAKER_COLUMN,
    target_text_column=SPEAKER_COLUMN,
    return_names=True
)

df.iloc[25:35]
Unnamed: 0 TimeStamp Turn Speaker Sentence Teacher Tag Student Tag
25 25 NaN 14.0 [PERSON1] Yeah, I know, and put โ€˜em up to there, and tha... NaN 4 - Making a Claim
26 26 NaN 14.0 [PERSON1] Hey, wait a minute, hey wait, maybe thatโ€™s it,... NaN 4 - Making a Claim
27 27 NaN 15.0 T Now take six of the ones 1 - None NaN
28 28 NaN 15.0 T Which is bigger? 8 - Press for Accuracy NaN
29 29 NaN 16.0 [PERSON0] One half NaN 4 - Making a Claim
30 30 NaN 17.0 [PERSON1] I think one half is... NaN 2 - Relating to Another Student
31 31 NaN NaN T Yes, [STUDENT_0] and [STUDENT_1]? 2 - Keeping Everyone Together NaN
32 32 NaN 17.0 [PERSON1] What do you have? NaN 2 - Relating to Another Student
33 33 NaN 17.0 Meredith and [PERSON1] Well NaN 1 - None
34 34 NaN 18.0 [PERSON1] we think NaN 1 - None

Now weโ€™ll group utterances from the same speaker together.

[ ]:
df = processor.merge_utterances_from_same_speaker(
    df=df,
    text_column=TEXT_COLUMN,
    speaker_column=SPEAKER_COLUMN,
    # We're going to directly replace the text in the text column.
    target_text_column=TEXT_COLUMN
)

df.iloc[25:35]
Sentence Speaker
25 dark green [PERSON1]
26 If you put it up to a whole Meredith
27 Iโ€™m sorry, whatโ€™s the number name for dark green T
28 One Meredith and [PERSON1]
29 Ok. T
30 And you put six ones up to the dark green Meredith
31 Hold on, Iโ€™m a little confused. Tell me again.... T
32 One sixth Meredith
33 One sixth. T
34 And then these, this would be [PERSON1]

We can see that the utterances from the same speaker are now grouped together!

๐Ÿ“ Conclusion and Where to Go From Here๏ƒ

In this tutorial, we learned how to use TextPreprocessor to: 1. Anonymize your data when you know the names of your students and educators. 2. Anonymize your data when you do not know the names of your students and educators. 3. Standardize your data for downstream feature annotation.

The next natural step is to annotate your data with features of interest. Here are some resources to get you started: - `edu-convokitโ€™s documentation on Annotator <https://edu-convokit.readthedocs.io/en/latest/annotation.html>`__ - `edu-convokitโ€™s tutorial on Annotator <https://colab.research.google.com/drive/1rBwEctFtmQowZHxralH2OGT5uV0zRIQw>`__

If you have any questions, please feel free to reach out to us on `edu-convokitโ€™s GitHub <https://github.com/rosewang2008/edu-convokit>`__.

๐Ÿ‘‹ Happy exploring your data with edu-convokit!

[ ]: