Analyzers

This page contains the documentation for the analyzers module. Our analyzers are designed to support the multi-faceted ways of analyzing language data in education.

Qualitative Analyzer

class edu_convokit.analyzers.QualitativeAnalyzer(data_dir: str | None = None, filenames: List[str] | str | None = None, dfs: List[DataFrame] | DataFrame | None = None, max_transcripts: int | None = None)[source]
print_examples(speaker_column: str, text_column: str, feature_column: str, df: DataFrame | None = None, feature_value: List[str] | str | None = None, max_num_values: int = 2, max_num_examples: int = 3, show_k_previous_lines: int = 0, show_k_next_lines: int = 0, dropna: bool = False) None[source]

Get text examples for a feature value.

Output =

[( [(speaker, text), …)], # previous text (speaker, current_text), # current text [(speaker, text), …], # next text feature_value) ), …]

Parameters:
  • speaker_column (str) – name of column containing speaker names

  • text_column (str) – name of column containing text to get predictions for

  • feature_column (str) – name of column containing feature to get examples for

  • df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor

  • feature_value (Union[str, List[str]]) – if not None, only get examples for this feature value

  • show_k_previous_lines (int) – show k previous lines

  • show_k_next_lines (int) – show k next lines

  • dropna (bool) – drop rows with NaN values in feature_column

Returns:

None

report_examples(speaker_column: str, text_column: str, feature_column: str, df: DataFrame | None = None, feature_value: float | List[float] | None = None, max_num_values: int = 2, max_num_examples: int = 3, show_k_previous_lines: int = 0, show_k_next_lines: int = 0, dropna: bool = False) str[source]

Get text examples for a feature value.

Output =

[( [(speaker, text), …)], # previous text (speaker, current_text), # current text [(speaker, text), …], # next text feature_value) ), …]

Parameters:
  • speaker_column (str) – name of column containing speaker names

  • text_column (str) – name of column containing text to get predictions for

  • feature_column (str) – name of column containing feature to get examples for

  • df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor

  • feature_value (Union[float, List[float]]) – if not None, only get examples for this feature value

  • show_k_previous_lines (int) – show k previous lines

  • show_k_next_lines (int) – show k next lines

  • dropna (bool) – drop rows with NaN values in feature_column

Returns:

formatted examples

Return type:

str

Quantitative Analyzer

class edu_convokit.analyzers.QuantitativeAnalyzer(data_dir: str | None = None, filenames: List[str] | str | None = None, dfs: List[DataFrame] | DataFrame | None = None, max_transcripts: int | None = None)[source]
plot_statistics(feature_column: str, df: DataFrame | None = None, speaker_column: str | None = None, value_as: str = 'raw', dropna: bool = False, title: str | None = None, xlabel: str | None = None, ylabel: str | None = None, save_path: str | None = None, xrange: Tuple[float, float] | None = None, yrange: Tuple[float, float] | None = None, label_mapping: Dict[str, str] | None = None)[source]

Plot statistics for a feature across all speakers.

Parameters:
  • feature_column (str) – name of column containing feature to compute statistics for

  • df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor

  • speaker_column (str) – name of column containing speaker names

  • value_as (str) – raw, avg, prop, all

  • dropna (bool) – drop rows with NaN values in feature_column

  • title (str) – title of plot

  • xlabel (str) – x-axis label

  • ylabel (str) – y-axis label

  • save_path (str) – path to save plot

  • xrange (Tuple[float, float]) – x-axis range

  • yrange (Tuple[float, float]) – y-axis range

  • label_mapping (Dict[str, str]) – mapping from speaker names to labels

Returns:

None

print_statistics(feature_column: str, df: DataFrame | None = None, speaker_column: str | None = None, value_as: str = 'raw', dropna: bool = False)[source]

Print statistics for a feature across all speakers.

Parameters:
  • feature_column (str) – name of column containing feature to compute statistics for

  • df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor

  • speaker_column (str) – name of column containing speaker names

  • value_as (str) – raw, avg, prop

  • dropna (bool) – drop rows with NaN values in feature_column

Returns:

None

report_statistics(feature_column: str, df: DataFrame | None = None, speaker_column: str | None = None, value_as: str = 'raw', dropna: bool = False) str[source]

Report statistics for a feature across all speakers.

Parameters:
  • feature_column (str) – name of column containing feature to compute statistics for

  • df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor

  • speaker_column (str) – name of column containing speaker names

  • value_as (str) – raw, avg, prop, all

  • dropna (bool) – drop rows with NaN values in feature_column

Returns:

string representation of statistics

Return type:

str

Lexical Analyzer

class edu_convokit.analyzers.LexicalAnalyzer(data_dir: str | None = None, filenames: List[str] | str | None = None, dfs: List[DataFrame] | DataFrame | None = None, max_transcripts: int | None = None)[source]
plot_log_odds(df1: DataFrame, df2: DataFrame, text_column1: str, text_column2: str, group1_name: str = 'Group 1', group2_name: str = 'Group 2', topk: int = 5, save_path: str | None = None, zscore: bool = True, logodds_factor: float = 1.0, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) None[source]

Plot topk log-odds for each df: ([(word, log-odds), …], [(word, log-odds), …])

Parameters:
  • df1 (pd.DataFrame) – pandas dataframe

  • df2 (pd.DataFrame) – pandas dataframe

  • text_column1 (str) – name of column containing text to analyze in df1

  • text_column2 (str) – name of column containing text to analyze in df2

  • group1_name (str) – name of group 1

  • group2_name (str) – name of group 2

  • topk (int) – number of top words to return

  • save_path (str) – path to save plot

  • zscore (bool) – whether to z-score the log-odds

  • logodds_factor (float) – factor to multiply standard deviation by to determine top words

  • run_text_formatting (bool) – whether to run standard text formatting

  • run_ngrams (bool) – whether to run ngrams

  • n (int) – n for ngrams

print_log_odds(df1: DataFrame, df2: DataFrame, text_column1: str, text_column2: str, topk: int = 5, zscore: bool = True, logodds_factor: float = 1.0, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) None[source]

Print topk log-odds for each df: ([(word, log-odds), …], [(word, log-odds), …])

Parameters:
  • df1 (pd.DataFrame) – pandas dataframe

  • df2 (pd.DataFrame) – pandas dataframe

  • text_column1 (str) – name of column containing text to analyze in df1

  • text_column2 (str) – name of column containing text to analyze in df2

  • topk (int) – number of top words to return

  • zscore (bool) – whether to z-score the log-odds

  • logodds_factor (float) – factor to multiply standard deviation by to determine top words

  • run_text_formatting (bool) – whether to run standard text formatting

  • run_ngrams (bool) – whether to run ngrams

  • n (int) – n for ngrams

print_word_frequency(text_column: str, topk: int = 5, df: DataFrame | None = None, speaker_column: str | None = None, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) str[source]

Print word frequency for a dataframe.

Parameters:
  • df (pd.DataFrame) – pandas dataframe

  • text_column (str) – name of column containing text to analyze

  • topk (int) – number of top words to return

  • speaker_column (str) – name of column containing speaker names. If specified, it will report word frequency for each speaker.

  • run_text_formatting (bool) – whether to run standard text formatting

  • run_ngrams (bool) – whether to run ngrams

  • n (int) – n for ngrams

Returns:

word frequency

Return type:

str

report_log_odds(df1: DataFrame, df2: DataFrame, text_column1: str, text_column2: str, topk: int = 5, zscore: bool = True, logodds_factor: float = 1.0, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) str[source]

Return formatted topk log-odds for each df: ([(word, log-odds), …], [(word, log-odds), …])

Parameters:
  • df1 (pd.DataFrame) – pandas dataframe

  • df2 (pd.DataFrame) – pandas dataframe

  • text_column1 (str) – name of column containing text to analyze in df1

  • text_column2 (str) – name of column containing text to analyze in df2

  • topk (int) – number of top words to return

  • zscore (bool) – whether to z-score the log-odds

  • logodds_factor (float) – factor to multiply standard deviation by to determine top words

  • run_text_formatting (bool) – whether to run standard text formatting

  • run_ngrams (bool) – whether to run ngrams

  • n (int) – n for ngrams

Returns:

formatted topk log-odds for each df: ([(word, log-odds), …], [(word, log-odds), …])

Return type:

str

report_word_frequency(text_column: str, topk: int = 5, df: DataFrame | None = None, speaker_column: str | None = None, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) str[source]

Reports word frequency for a dataframe as a string.

Parameters:
  • df (pd.DataFrame) – pandas dataframe

  • text_column (str) – name of column containing text to analyze

  • topk (int) – number of top words to return

  • speaker_column (str) – name of column containing speaker names. If specified, it will report word frequency for each speaker.

  • run_text_formatting (bool) – whether to run standard text formatting

  • run_ngrams (bool) – whether to run ngrams

  • n (int) – n for ngrams

Returns:

word frequency

Return type:

str

Temporal Analyzer

class edu_convokit.analyzers.TemporalAnalyzer(data_dir: str | None = None, filenames: List[str] | str | None = None, dfs: List[DataFrame] | DataFrame | None = None, max_transcripts: int | None = None)[source]
plot_temporal_statistics(feature_column: str, dfs: List[DataFrame] | DataFrame | None = None, speaker_column: str | None = None, value_as: str = 'raw', num_bins: int = 10, dropna: bool = False, title: str | None = None, xlabel: str | None = None, ylabel: str | None = None, save_path: str | None = None, hue: str | None = None, xrange: Tuple[float, float] | None = None, yrange: Tuple[float, float] | None = None, label_mapping: Dict[str, str] | None = None)[source]

Plot statistics for a feature across all speakers across bins

Parameters:
  • feature_column (str) – name of column containing feature to compute statistics for

  • dfs (Union[List[pd.DataFrame], pd.DataFrame]) – list of dataframes. If None, use self.dfs from constructor

  • speaker_column (str) – name of column containing speaker names

  • value_as (str) – raw, avg, prop

  • num_bins (int) – number of bins to split the data into

  • dropna (bool) – drop rows with NaN values in feature_column

  • title (str) – title of plot

  • xlabel (str) – x-axis label

  • ylabel (str) – y-axis label

  • save_path (str) – path to save plot

  • hue (str) – name of column to use for hue

  • xrange (Tuple[float, float]) – x-axis range

  • yrange (Tuple[float, float]) – y-axis range

  • label_mapping (Dict[str, str]) – mapping from original label to new label

Returns:

None

print_statistics(feature_column: str, speaker_column: str | None = None, value_as: str = 'raw', dropna: bool = False, dfs: List[DataFrame] | DataFrame | None = None)[source]

Print statistics for a feature across all speakers.

Parameters:
  • feature_column (str) – name of column containing feature to compute statistics for

  • speaker_column (str) – name of column containing speaker names

  • value_as (str) – raw, avg, prop

  • dropna (bool) – drop rows with NaN values in feature_column

  • dfs (Union[List[pd.DataFrame], pd.DataFrame]) – list of dataframes. If None, use self.dfs from constructor

Returns:

None

report_statistics(feature_column: str, dfs: List[DataFrame] | DataFrame | None = None, speaker_column: str | None = None, num_bins: int = 10, value_as: str = 'raw', dropna: bool = False) str[source]

Report statistics for a feature across all speakers across the bins.

Parameters:
  • feature_column (str) – name of column containing feature to compute statistics for

  • dfs (Union[List[pd.DataFrame], pd.DataFrame]) – list of dataframes. If None, use self.dfs from constructor

  • speaker_column (str) – name of column containing speaker names

  • num_bins (int) – number of bins to split the data into

  • value_as (str) – raw, avg, prop, all

  • dropna (bool) – drop rows with NaN values in feature_column

Returns:

string representation of statistics

Return type:

str

GPT Conversation Analyzer

class edu_convokit.analyzers.GPTConversationAnalyzer(data_dir: str | None = None, filenames: List[str] | str | None = None, dfs: List[DataFrame] | DataFrame | None = None, max_transcripts: int | None = None)[source]
preview_prompt(df: DataFrame, prompt_name: str, text_column: str, speaker_column: str, model: str = 'gpt-4', add_line_numbers: bool = False, format_template: str = '{speaker}: {text}', keep_transcript_fraction: float | None = None) str[source]

Preview a prompt on a dataframe and return the prompt.

Parameters:
  • df (pd.DataFrame) – pandas dataframe

  • prompt_name (str) – name of prompt

  • text_column (str) – name of column containing text

  • speaker_column (str) – name of column containing speaker names

  • model (str) – model name

  • add_line_numbers (bool) – whether to add line numbers

  • format_template (str) – format string

  • keep_transcript_fraction (float) – fraction of transcript to keep

Returns:

prompt

Return type:

str

run_prompt(df: DataFrame, prompt_name: str, text_column: str, speaker_column: str, model: str = 'gpt-4', add_line_numbers: bool = False, format_template: str = '{speaker}: {text}', temperature: float = 0.0, max_tokens: int | None = None, keep_transcript_fraction: float | None = None) Tuple[str, str][source]

Run a prompt on a dataframe and return the (prompt, response) pair.

Parameters:
  • df (pd.DataFrame) – pandas dataframe

  • prompt_name (str) – name of prompt

  • text_column (str) – name of column containing text

  • speaker_column (str) – name of column containing speaker names

  • model (str) – model name

  • add_line_numbers (bool) – whether to add line numbers

  • format_template (str) – format string

  • temperature (float) – temperature

  • max_tokens (int) – maximum number of tokens

  • keep_transcript_fraction (float) – fraction of transcript to keep

Returns:

(prompt, response) pair

Return type:

Tuple[str, str]