Analyzers
This page contains the documentation for the analyzers module. Our analyzers are designed to support the multi-faceted ways of analyzing language data in education.
Qualitative Analyzer
- class edu_convokit.analyzers.QualitativeAnalyzer(data_dir: str | None = None, filenames: List[str] | str | None = None, dfs: List[DataFrame] | DataFrame | None = None, max_transcripts: int | None = None)[source]
- print_examples(speaker_column: str, text_column: str, feature_column: str, df: DataFrame | None = None, feature_value: List[str] | str | None = None, max_num_values: int = 2, max_num_examples: int = 3, show_k_previous_lines: int = 0, show_k_next_lines: int = 0, dropna: bool = False) None[source]
Get text examples for a feature value.
- Output =
[( [(speaker, text), …)], # previous text (speaker, current_text), # current text [(speaker, text), …], # next text feature_value) ), …]
- Parameters:
speaker_column (str) – name of column containing speaker names
text_column (str) – name of column containing text to get predictions for
feature_column (str) – name of column containing feature to get examples for
df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor
feature_value (Union[str, List[str]]) – if not None, only get examples for this feature value
show_k_previous_lines (int) – show k previous lines
show_k_next_lines (int) – show k next lines
dropna (bool) – drop rows with NaN values in feature_column
- Returns:
None
- report_examples(speaker_column: str, text_column: str, feature_column: str, df: DataFrame | None = None, feature_value: float | List[float] | None = None, max_num_values: int = 2, max_num_examples: int = 3, show_k_previous_lines: int = 0, show_k_next_lines: int = 0, dropna: bool = False) str[source]
Get text examples for a feature value.
- Output =
[( [(speaker, text), …)], # previous text (speaker, current_text), # current text [(speaker, text), …], # next text feature_value) ), …]
- Parameters:
speaker_column (str) – name of column containing speaker names
text_column (str) – name of column containing text to get predictions for
feature_column (str) – name of column containing feature to get examples for
df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor
feature_value (Union[float, List[float]]) – if not None, only get examples for this feature value
show_k_previous_lines (int) – show k previous lines
show_k_next_lines (int) – show k next lines
dropna (bool) – drop rows with NaN values in feature_column
- Returns:
formatted examples
- Return type:
str
Quantitative Analyzer
- class edu_convokit.analyzers.QuantitativeAnalyzer(data_dir: str | None = None, filenames: List[str] | str | None = None, dfs: List[DataFrame] | DataFrame | None = None, max_transcripts: int | None = None)[source]
- plot_statistics(feature_column: str, df: DataFrame | None = None, speaker_column: str | None = None, value_as: str = 'raw', dropna: bool = False, title: str | None = None, xlabel: str | None = None, ylabel: str | None = None, save_path: str | None = None, xrange: Tuple[float, float] | None = None, yrange: Tuple[float, float] | None = None, label_mapping: Dict[str, str] | None = None)[source]
Plot statistics for a feature across all speakers.
- Parameters:
feature_column (str) – name of column containing feature to compute statistics for
df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor
speaker_column (str) – name of column containing speaker names
value_as (str) – raw, avg, prop, all
dropna (bool) – drop rows with NaN values in feature_column
title (str) – title of plot
xlabel (str) – x-axis label
ylabel (str) – y-axis label
save_path (str) – path to save plot
xrange (Tuple[float, float]) – x-axis range
yrange (Tuple[float, float]) – y-axis range
label_mapping (Dict[str, str]) – mapping from speaker names to labels
- Returns:
None
- print_statistics(feature_column: str, df: DataFrame | None = None, speaker_column: str | None = None, value_as: str = 'raw', dropna: bool = False)[source]
Print statistics for a feature across all speakers.
- Parameters:
feature_column (str) – name of column containing feature to compute statistics for
df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor
speaker_column (str) – name of column containing speaker names
value_as (str) – raw, avg, prop
dropna (bool) – drop rows with NaN values in feature_column
- Returns:
None
- report_statistics(feature_column: str, df: DataFrame | None = None, speaker_column: str | None = None, value_as: str = 'raw', dropna: bool = False) str[source]
Report statistics for a feature across all speakers.
- Parameters:
feature_column (str) – name of column containing feature to compute statistics for
df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor
speaker_column (str) – name of column containing speaker names
value_as (str) – raw, avg, prop, all
dropna (bool) – drop rows with NaN values in feature_column
- Returns:
string representation of statistics
- Return type:
str
Lexical Analyzer
- class edu_convokit.analyzers.LexicalAnalyzer(data_dir: str | None = None, filenames: List[str] | str | None = None, dfs: List[DataFrame] | DataFrame | None = None, max_transcripts: int | None = None)[source]
- plot_log_odds(df1: DataFrame, df2: DataFrame, text_column1: str, text_column2: str, group1_name: str = 'Group 1', group2_name: str = 'Group 2', topk: int = 5, save_path: str | None = None, zscore: bool = True, logodds_factor: float = 1.0, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) None[source]
Plot topk log-odds for each df: ([(word, log-odds), …], [(word, log-odds), …])
- Parameters:
df1 (pd.DataFrame) – pandas dataframe
df2 (pd.DataFrame) – pandas dataframe
text_column1 (str) – name of column containing text to analyze in df1
text_column2 (str) – name of column containing text to analyze in df2
group1_name (str) – name of group 1
group2_name (str) – name of group 2
topk (int) – number of top words to return
save_path (str) – path to save plot
zscore (bool) – whether to z-score the log-odds
logodds_factor (float) – factor to multiply standard deviation by to determine top words
run_text_formatting (bool) – whether to run standard text formatting
run_ngrams (bool) – whether to run ngrams
n (int) – n for ngrams
- print_log_odds(df1: DataFrame, df2: DataFrame, text_column1: str, text_column2: str, topk: int = 5, zscore: bool = True, logodds_factor: float = 1.0, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) None[source]
Print topk log-odds for each df: ([(word, log-odds), …], [(word, log-odds), …])
- Parameters:
df1 (pd.DataFrame) – pandas dataframe
df2 (pd.DataFrame) – pandas dataframe
text_column1 (str) – name of column containing text to analyze in df1
text_column2 (str) – name of column containing text to analyze in df2
topk (int) – number of top words to return
zscore (bool) – whether to z-score the log-odds
logodds_factor (float) – factor to multiply standard deviation by to determine top words
run_text_formatting (bool) – whether to run standard text formatting
run_ngrams (bool) – whether to run ngrams
n (int) – n for ngrams
- print_word_frequency(text_column: str, topk: int = 5, df: DataFrame | None = None, speaker_column: str | None = None, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) str[source]
Print word frequency for a dataframe.
- Parameters:
df (pd.DataFrame) – pandas dataframe
text_column (str) – name of column containing text to analyze
topk (int) – number of top words to return
speaker_column (str) – name of column containing speaker names. If specified, it will report word frequency for each speaker.
run_text_formatting (bool) – whether to run standard text formatting
run_ngrams (bool) – whether to run ngrams
n (int) – n for ngrams
- Returns:
word frequency
- Return type:
str
- report_log_odds(df1: DataFrame, df2: DataFrame, text_column1: str, text_column2: str, topk: int = 5, zscore: bool = True, logodds_factor: float = 1.0, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) str[source]
Return formatted topk log-odds for each df: ([(word, log-odds), …], [(word, log-odds), …])
- Parameters:
df1 (pd.DataFrame) – pandas dataframe
df2 (pd.DataFrame) – pandas dataframe
text_column1 (str) – name of column containing text to analyze in df1
text_column2 (str) – name of column containing text to analyze in df2
topk (int) – number of top words to return
zscore (bool) – whether to z-score the log-odds
logodds_factor (float) – factor to multiply standard deviation by to determine top words
run_text_formatting (bool) – whether to run standard text formatting
run_ngrams (bool) – whether to run ngrams
n (int) – n for ngrams
- Returns:
formatted topk log-odds for each df: ([(word, log-odds), …], [(word, log-odds), …])
- Return type:
str
- report_word_frequency(text_column: str, topk: int = 5, df: DataFrame | None = None, speaker_column: str | None = None, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) str[source]
Reports word frequency for a dataframe as a string.
- Parameters:
df (pd.DataFrame) – pandas dataframe
text_column (str) – name of column containing text to analyze
topk (int) – number of top words to return
speaker_column (str) – name of column containing speaker names. If specified, it will report word frequency for each speaker.
run_text_formatting (bool) – whether to run standard text formatting
run_ngrams (bool) – whether to run ngrams
n (int) – n for ngrams
- Returns:
word frequency
- Return type:
str
Temporal Analyzer
- class edu_convokit.analyzers.TemporalAnalyzer(data_dir: str | None = None, filenames: List[str] | str | None = None, dfs: List[DataFrame] | DataFrame | None = None, max_transcripts: int | None = None)[source]
- plot_temporal_statistics(feature_column: str, dfs: List[DataFrame] | DataFrame | None = None, speaker_column: str | None = None, value_as: str = 'raw', num_bins: int = 10, dropna: bool = False, title: str | None = None, xlabel: str | None = None, ylabel: str | None = None, save_path: str | None = None, hue: str | None = None, xrange: Tuple[float, float] | None = None, yrange: Tuple[float, float] | None = None, label_mapping: Dict[str, str] | None = None)[source]
Plot statistics for a feature across all speakers across bins
- Parameters:
feature_column (str) – name of column containing feature to compute statistics for
dfs (Union[List[pd.DataFrame], pd.DataFrame]) – list of dataframes. If None, use self.dfs from constructor
speaker_column (str) – name of column containing speaker names
value_as (str) – raw, avg, prop
num_bins (int) – number of bins to split the data into
dropna (bool) – drop rows with NaN values in feature_column
title (str) – title of plot
xlabel (str) – x-axis label
ylabel (str) – y-axis label
save_path (str) – path to save plot
hue (str) – name of column to use for hue
xrange (Tuple[float, float]) – x-axis range
yrange (Tuple[float, float]) – y-axis range
label_mapping (Dict[str, str]) – mapping from original label to new label
- Returns:
None
- print_statistics(feature_column: str, speaker_column: str | None = None, value_as: str = 'raw', dropna: bool = False, dfs: List[DataFrame] | DataFrame | None = None)[source]
Print statistics for a feature across all speakers.
- Parameters:
feature_column (str) – name of column containing feature to compute statistics for
speaker_column (str) – name of column containing speaker names
value_as (str) – raw, avg, prop
dropna (bool) – drop rows with NaN values in feature_column
dfs (Union[List[pd.DataFrame], pd.DataFrame]) – list of dataframes. If None, use self.dfs from constructor
- Returns:
None
- report_statistics(feature_column: str, dfs: List[DataFrame] | DataFrame | None = None, speaker_column: str | None = None, num_bins: int = 10, value_as: str = 'raw', dropna: bool = False) str[source]
Report statistics for a feature across all speakers across the bins.
- Parameters:
feature_column (str) – name of column containing feature to compute statistics for
dfs (Union[List[pd.DataFrame], pd.DataFrame]) – list of dataframes. If None, use self.dfs from constructor
speaker_column (str) – name of column containing speaker names
num_bins (int) – number of bins to split the data into
value_as (str) – raw, avg, prop, all
dropna (bool) – drop rows with NaN values in feature_column
- Returns:
string representation of statistics
- Return type:
str
GPT Conversation Analyzer
- class edu_convokit.analyzers.GPTConversationAnalyzer(data_dir: str | None = None, filenames: List[str] | str | None = None, dfs: List[DataFrame] | DataFrame | None = None, max_transcripts: int | None = None)[source]
- preview_prompt(df: DataFrame, prompt_name: str, text_column: str, speaker_column: str, model: str = 'gpt-4', add_line_numbers: bool = False, format_template: str = '{speaker}: {text}', keep_transcript_fraction: float | None = None) str[source]
Preview a prompt on a dataframe and return the prompt.
- Parameters:
df (pd.DataFrame) – pandas dataframe
prompt_name (str) – name of prompt
text_column (str) – name of column containing text
speaker_column (str) – name of column containing speaker names
model (str) – model name
add_line_numbers (bool) – whether to add line numbers
format_template (str) – format string
keep_transcript_fraction (float) – fraction of transcript to keep
- Returns:
prompt
- Return type:
str
- run_prompt(df: DataFrame, prompt_name: str, text_column: str, speaker_column: str, model: str = 'gpt-4', add_line_numbers: bool = False, format_template: str = '{speaker}: {text}', temperature: float = 0.0, max_tokens: int | None = None, keep_transcript_fraction: float | None = None) Tuple[str, str][source]
Run a prompt on a dataframe and return the (prompt, response) pair.
- Parameters:
df (pd.DataFrame) – pandas dataframe
prompt_name (str) – name of prompt
text_column (str) – name of column containing text
speaker_column (str) – name of column containing speaker names
model (str) – model name
add_line_numbers (bool) – whether to add line numbers
format_template (str) – format string
temperature (float) – temperature
max_tokens (int) – maximum number of tokens
keep_transcript_fraction (float) – fraction of transcript to keep
- Returns:
(prompt, response) pair
- Return type:
Tuple[str, str]