Analyzers

This page contains the documentation for the analyzers module. Our analyzers are designed to support the multi-faceted ways of analyzing language data in education.

Qualitative Analyzer

print_examples(speaker_column: str, text_column: str, feature_column: str, df: DataFrame | None = None, feature_value: List[str] | str | None = None, max_num_values: int = 2, max_num_examples: int = 3, show_k_previous_lines: int = 0, show_k_next_lines: int = 0, dropna: bool = False) → None[source]

Get text examples for a feature value.

Output =: [( [(speaker, text), …)], # previous text (speaker, current_text), # current text [(speaker, text), …], # next text feature_value) ), …]

Parameters:

speaker_column (str) – name of column containing speaker names
text_column (str) – name of column containing text to get predictions for
feature_column (str) – name of column containing feature to get examples for
df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor
feature_value (Union[str, List[str]]) – if not None, only get examples for this feature value
show_k_previous_lines (int) – show k previous lines
show_k_next_lines (int) – show k next lines
dropna (bool) – drop rows with NaN values in feature_column

Returns:

None

report_examples(speaker_column: str, text_column: str, feature_column: str, df: DataFrame | None = None, feature_value: float | List[float] | None = None, max_num_values: int = 2, max_num_examples: int = 3, show_k_previous_lines: int = 0, show_k_next_lines: int = 0, dropna: bool = False) → str[source]

Get text examples for a feature value.

Output =: [( [(speaker, text), …)], # previous text (speaker, current_text), # current text [(speaker, text), …], # next text feature_value) ), …]

Parameters:

speaker_column (str) – name of column containing speaker names
text_column (str) – name of column containing text to get predictions for
feature_column (str) – name of column containing feature to get examples for
df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor
feature_value (Union[float, List[float]]) – if not None, only get examples for this feature value
show_k_previous_lines (int) – show k previous lines
show_k_next_lines (int) – show k next lines
dropna (bool) – drop rows with NaN values in feature_column

Returns:

formatted examples

Return type:

str

Quantitative Analyzer

Plot statistics for a feature across all speakers.

Parameters:

feature_column (str) – name of column containing feature to compute statistics for
df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor
speaker_column (str) – name of column containing speaker names
value_as (str) – raw, avg, prop, all
dropna (bool) – drop rows with NaN values in feature_column
title (str) – title of plot
xlabel (str) – x-axis label
ylabel (str) – y-axis label
save_path (str) – path to save plot
xrange (Tuple[float, float]) – x-axis range
yrange (Tuple[float, float]) – y-axis range
label_mapping (Dict[str, str]) – mapping from speaker names to labels

Returns:

None

print_statistics(feature_column: str, df: DataFrame | None = None, speaker_column: str | None = None, value_as: str = 'raw', dropna: bool = False)[source]

Print statistics for a feature across all speakers.

Parameters:

feature_column (str) – name of column containing feature to compute statistics for
df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor
speaker_column (str) – name of column containing speaker names
value_as (str) – raw, avg, prop
dropna (bool) – drop rows with NaN values in feature_column

Returns:

None

report_statistics(feature_column: str, df: DataFrame | None = None, speaker_column: str | None = None, value_as: str = 'raw', dropna: bool = False) → str[source]

Report statistics for a feature across all speakers.

Parameters:

feature_column (str) – name of column containing feature to compute statistics for
df (pd.DataFrame) – pandas dataframe. If None, then use self.dfs from constructor
speaker_column (str) – name of column containing speaker names
value_as (str) – raw, avg, prop, all
dropna (bool) – drop rows with NaN values in feature_column

Returns:

string representation of statistics

Return type:

str

Lexical Analyzer

plot_log_odds(df1: DataFrame, df2: DataFrame, text_column1: str, text_column2: str, group1_name: str = 'Group 1', group2_name: str = 'Group 2', topk: int = 5, save_path: str | None = None, zscore: bool = True, logodds_factor: float = 1.0, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) → None[source]

Plot topk log-odds for each df: ([(word, log-odds), …], [(word, log-odds), …])

Parameters:

df1 (pd.DataFrame) – pandas dataframe
df2 (pd.DataFrame) – pandas dataframe
text_column1 (str) – name of column containing text to analyze in df1
text_column2 (str) – name of column containing text to analyze in df2
group1_name (str) – name of group 1
group2_name (str) – name of group 2
topk (int) – number of top words to return
save_path (str) – path to save plot
zscore (bool) – whether to z-score the log-odds
logodds_factor (float) – factor to multiply standard deviation by to determine top words
run_text_formatting (bool) – whether to run standard text formatting
run_ngrams (bool) – whether to run ngrams
n (int) – n for ngrams

print_log_odds(df1: DataFrame, df2: DataFrame, text_column1: str, text_column2: str, topk: int = 5, zscore: bool = True, logodds_factor: float = 1.0, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) → None[source]

Print topk log-odds for each df: ([(word, log-odds), …], [(word, log-odds), …])

Parameters:

df1 (pd.DataFrame) – pandas dataframe
df2 (pd.DataFrame) – pandas dataframe
text_column1 (str) – name of column containing text to analyze in df1
text_column2 (str) – name of column containing text to analyze in df2
topk (int) – number of top words to return
zscore (bool) – whether to z-score the log-odds
logodds_factor (float) – factor to multiply standard deviation by to determine top words
run_text_formatting (bool) – whether to run standard text formatting
run_ngrams (bool) – whether to run ngrams
n (int) – n for ngrams

print_word_frequency(text_column: str, topk: int = 5, df: DataFrame | None = None, speaker_column: str | None = None, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) → str[source]

Print word frequency for a dataframe.

Parameters:

df (pd.DataFrame) – pandas dataframe
text_column (str) – name of column containing text to analyze
topk (int) – number of top words to return
speaker_column (str) – name of column containing speaker names. If specified, it will report word frequency for each speaker.
run_text_formatting (bool) – whether to run standard text formatting
run_ngrams (bool) – whether to run ngrams
n (int) – n for ngrams

Returns:

word frequency

Return type:

str

report_log_odds(df1: DataFrame, df2: DataFrame, text_column1: str, text_column2: str, topk: int = 5, zscore: bool = True, logodds_factor: float = 1.0, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) → str[source]

Return formatted topk log-odds for each df: ([(word, log-odds), …], [(word, log-odds), …])

Parameters:

df1 (pd.DataFrame) – pandas dataframe
df2 (pd.DataFrame) – pandas dataframe
text_column1 (str) – name of column containing text to analyze in df1
text_column2 (str) – name of column containing text to analyze in df2
topk (int) – number of top words to return
zscore (bool) – whether to z-score the log-odds
logodds_factor (float) – factor to multiply standard deviation by to determine top words
run_text_formatting (bool) – whether to run standard text formatting
run_ngrams (bool) – whether to run ngrams
n (int) – n for ngrams

Returns:

formatted topk log-odds for each df: ([(word, log-odds), …], [(word, log-odds), …])

Return type:

str

report_word_frequency(text_column: str, topk: int = 5, df: DataFrame | None = None, speaker_column: str | None = None, run_text_formatting: bool = False, run_ngrams: bool = False, n: int = 0) → str[source]

Reports word frequency for a dataframe as a string.

Parameters:

df (pd.DataFrame) – pandas dataframe
text_column (str) – name of column containing text to analyze
topk (int) – number of top words to return
speaker_column (str) – name of column containing speaker names. If specified, it will report word frequency for each speaker.
run_text_formatting (bool) – whether to run standard text formatting
run_ngrams (bool) – whether to run ngrams
n (int) – n for ngrams

Returns:

word frequency

Return type:

str

Temporal Analyzer

Plot statistics for a feature across all speakers across bins

Parameters:

feature_column (str) – name of column containing feature to compute statistics for
dfs (Union[List[pd.DataFrame], pd.DataFrame]) – list of dataframes. If None, use self.dfs from constructor
speaker_column (str) – name of column containing speaker names
value_as (str) – raw, avg, prop
num_bins (int) – number of bins to split the data into
dropna (bool) – drop rows with NaN values in feature_column
title (str) – title of plot
xlabel (str) – x-axis label
ylabel (str) – y-axis label
save_path (str) – path to save plot
hue (str) – name of column to use for hue
xrange (Tuple[float, float]) – x-axis range
yrange (Tuple[float, float]) – y-axis range
label_mapping (Dict[str, str]) – mapping from original label to new label

Returns:

None

print_statistics(feature_column: str, speaker_column: str | None = None, value_as: str = 'raw', dropna: bool = False, dfs: List[DataFrame] | DataFrame | None = None)[source]

Print statistics for a feature across all speakers.

Parameters:

feature_column (str) – name of column containing feature to compute statistics for
speaker_column (str) – name of column containing speaker names
value_as (str) – raw, avg, prop
dropna (bool) – drop rows with NaN values in feature_column
dfs (Union[List[pd.DataFrame], pd.DataFrame]) – list of dataframes. If None, use self.dfs from constructor

Returns:

None

report_statistics(feature_column: str, dfs: List[DataFrame] | DataFrame | None = None, speaker_column: str | None = None, num_bins: int = 10, value_as: str = 'raw', dropna: bool = False) → str[source]

Report statistics for a feature across all speakers across the bins.

Parameters:

feature_column (str) – name of column containing feature to compute statistics for
dfs (Union[List[pd.DataFrame], pd.DataFrame]) – list of dataframes. If None, use self.dfs from constructor
speaker_column (str) – name of column containing speaker names
num_bins (int) – number of bins to split the data into
value_as (str) – raw, avg, prop, all
dropna (bool) – drop rows with NaN values in feature_column

Returns:

string representation of statistics

Return type:

str

GPT Conversation Analyzer

preview_prompt(df: DataFrame, prompt_name: str, text_column: str, speaker_column: str, model: str = 'gpt-4', add_line_numbers: bool = False, format_template: str = '{speaker}: {text}', keep_transcript_fraction: float | None = None) → str[source]

Preview a prompt on a dataframe and return the prompt.

Parameters:

df (pd.DataFrame) – pandas dataframe
prompt_name (str) – name of prompt
text_column (str) – name of column containing text
speaker_column (str) – name of column containing speaker names
model (str) – model name
add_line_numbers (bool) – whether to add line numbers
format_template (str) – format string
keep_transcript_fraction (float) – fraction of transcript to keep

Returns:

prompt

Return type:

str

run_prompt(df: DataFrame, prompt_name: str, text_column: str, speaker_column: str, model: str = 'gpt-4', add_line_numbers: bool = False, format_template: str = '{speaker}: {text}', temperature: float = 0.0, max_tokens: int | None = None, keep_transcript_fraction: float | None = None) → Tuple[str, str][source]

Run a prompt on a dataframe and return the (prompt, response) pair.

Parameters:

df (pd.DataFrame) – pandas dataframe
prompt_name (str) – name of prompt
text_column (str) – name of column containing text
speaker_column (str) – name of column containing speaker names
model (str) – model name
add_line_numbers (bool) – whether to add line numbers
format_template (str) – format string
temperature (float) – temperature
max_tokens (int) – maximum number of tokens
keep_transcript_fraction (float) – fraction of transcript to keep

Returns:

(prompt, response) pair

Return type:

Tuple[str, str]