Pre-Processing

This page contains the documentation for the pre-processing modules. Currently, only text pre-processing is supported. However in the future, we hope to additionally support audio and video pre-processing.

Text Pre-Processing

class edu_convokit.preprocessors.TextPreprocessor[source]
__init__()[source]
anonymize_known_names(df: DataFrame, text_column: str, names: str | List[str], replacement_names: str | List[str], target_text_column: str | None = None) DataFrame[source]

Anonymize a dataframe with known names.

Parameters:
  • df (pd.DataFrame) – pandas dataframe

  • text_column (str) – name of column containing text to anonymize

  • names (Union[str, List[str]]) – names to anonymize

  • replacement_names (Union[str, List[str]]) – replacement names

  • target_text_column (str) – name of column to store anonymized text. If None, will overwrite text_column.

Returns:

dataframe with anonymized text

Return type:

pd.DataFrame

anonymize_unknown_names(df: DataFrame, text_column: str, target_text_column: str | None = None, return_names: bool = False) DataFrame[source]

Anonymize a dataframe with unknown names.

Parameters:
  • df (pd.DataFrame) – pandas dataframe

  • text_column (str) – name of column containing text to anonymize

  • target_text_column (str) – name of column to store anonymized text. If None, will overwrite text_column.

  • return_names (bool) – if True, return names and replacement_names

Returns:

dataframe with anonymized text Optional[Tuple[List[str], List[str]]]: names and replacement_names

Return type:

pd.DataFrame

get_speaker_text_format(df: DataFrame, text_column: str, speaker_column: str, format: str = '{speaker}: {text}') str[source]

Return a string with the speaker and text formatted according to the format string.

Parameters:
  • df (pd.DataFrame) – pandas dataframe

  • text_column (str) – name of column containing text

  • speaker_column (str) – name of column containing speaker names

  • format (str) – format string

Returns:

formatted string

Return type:

str

merge_utterances_from_same_speaker(df: DataFrame, text_column: str, speaker_column: str, target_text_column: str) DataFrame[source]

Create new dataframe where the utterances from same speaker are grouped together.

Parameters:
  • df (pd.DataFrame) – pandas dataframe

  • text_column (str) – name of column containing text to merge utterances

  • speaker_column (str) – name of column containing speaker names

  • target_text_column (str) – name of column to store merged text

Returns:

dataframe with merged text

Return type:

pd.DataFrame

Token Pre-Processing

class edu_convokit.preprocessors.TokenPreprocessor(model: str)[source]
__init__(model: str)[source]
format_transcript_within_budget(df: DataFrame, text_column: str, speaker_column: str, max_token_budget: int, format_template: str = '{speaker}: {text}', add_line_numbers: bool = False, print_num_tokens: bool = False) str[source]

Format a transcript within a token budget.

Parameters:
  • df (pd.DataFrame) – pandas dataframe

  • text_column (str) – name of column containing text

  • speaker_column (str) – name of column containing speaker names

  • max_token_budget (int) – maximum number of tokens

  • format_template (str) – format string

  • add_line_numbers (bool) – whether to add line numbers

  • print_num_tokens (bool) – whether to print the number of tokens

Returns:

formatted string

Return type:

str

get_num_tokens_from_messages(messages: str | List[str] | List[dict]) int[source]

Return the number of tokens in a string or list of strings. Code adapted from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

Parameters:

text (Union[str, List[str]]) – string or list of strings

Returns:

number of tokens in text

Return type:

Union[int, List[int]]

get_num_tokens_from_string(string: str) int[source]

Returns the number of tokens in a text string. From https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb.