Pre-Processing
This page contains the documentation for the pre-processing modules. Currently, only text pre-processing is supported. However in the future, we hope to additionally support audio and video pre-processing.
Text Pre-Processing
- class edu_convokit.preprocessors.TextPreprocessor[source]
-
- anonymize_known_names(df: DataFrame, text_column: str, names: str | List[str], replacement_names: str | List[str], target_text_column: str | None = None) DataFrame [source]
Anonymize a dataframe with known names.
- Parameters:
df (pd.DataFrame) – pandas dataframe
text_column (str) – name of column containing text to anonymize
names (Union[str, List[str]]) – names to anonymize
replacement_names (Union[str, List[str]]) – replacement names
target_text_column (str) – name of column to store anonymized text. If None, will overwrite text_column.
- Returns:
dataframe with anonymized text
- Return type:
pd.DataFrame
- anonymize_unknown_names(df: DataFrame, text_column: str, target_text_column: str | None = None, return_names: bool = False) DataFrame [source]
Anonymize a dataframe with unknown names.
- Parameters:
df (pd.DataFrame) – pandas dataframe
text_column (str) – name of column containing text to anonymize
target_text_column (str) – name of column to store anonymized text. If None, will overwrite text_column.
return_names (bool) – if True, return names and replacement_names
- Returns:
dataframe with anonymized text Optional[Tuple[List[str], List[str]]]: names and replacement_names
- Return type:
pd.DataFrame
- get_speaker_text_format(df: DataFrame, text_column: str, speaker_column: str, format: str = '{speaker}: {text}') str [source]
Return a string with the speaker and text formatted according to the format string.
- Parameters:
df (pd.DataFrame) – pandas dataframe
text_column (str) – name of column containing text
speaker_column (str) – name of column containing speaker names
format (str) – format string
- Returns:
formatted string
- Return type:
str
- merge_utterances_from_same_speaker(df: DataFrame, text_column: str, speaker_column: str, target_text_column: str) DataFrame [source]
Create new dataframe where the utterances from same speaker are grouped together.
- Parameters:
df (pd.DataFrame) – pandas dataframe
text_column (str) – name of column containing text to merge utterances
speaker_column (str) – name of column containing speaker names
target_text_column (str) – name of column to store merged text
- Returns:
dataframe with merged text
- Return type:
pd.DataFrame
Token Pre-Processing
- class edu_convokit.preprocessors.TokenPreprocessor(model: str)[source]
-
- format_transcript_within_budget(df: DataFrame, text_column: str, speaker_column: str, max_token_budget: int, format_template: str = '{speaker}: {text}', add_line_numbers: bool = False, print_num_tokens: bool = False) str [source]
Format a transcript within a token budget.
- Parameters:
df (pd.DataFrame) – pandas dataframe
text_column (str) – name of column containing text
speaker_column (str) – name of column containing speaker names
max_token_budget (int) – maximum number of tokens
format_template (str) – format string
add_line_numbers (bool) – whether to add line numbers
print_num_tokens (bool) – whether to print the number of tokens
- Returns:
formatted string
- Return type:
str
- get_num_tokens_from_messages(messages: str | List[str] | List[dict]) int [source]
Return the number of tokens in a string or list of strings. Code adapted from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
- Parameters:
text (Union[str, List[str]]) – string or list of strings
- Returns:
number of tokens in text
- Return type:
Union[int, List[int]]
- get_num_tokens_from_string(string: str) int [source]
Returns the number of tokens in a text string. From https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb.