Sample-Size Evaluation (SyntheSize)

SyNG-BTS integrates the SyntheSize methodology for evaluating how classifier performance scales with sample size. This is useful for answering the question: “How many samples do I need for reliable classification?”

The integration provides two public functions:

  • evaluate_sample_sizes() — Evaluate classifiers across candidate sample sizes using stratified cross-validation.

  • plot_sample_sizes() — Visualize inverse power-law (IPLF) learning curves fitted from evaluation metrics.

Background

The SyntheSize approach trains multiple classifiers (logistic regression, SVM, KNN, random forest, XGBoost) at varying sample sizes and fits inverse power-law curves to the resulting metrics (F1, accuracy, AUC). This reveals how classification performance scales with data volume and helps determine whether generating more synthetic samples would improve downstream analyses.

For more details on the methodology, see:

Quick Start

Evaluate a DataFrame

import numpy as np
import pandas as pd
from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data

# Load a bundled dataset
data, groups = resolve_data("BRCASubtypeSel_test")

# Evaluate classifiers across sample sizes
metrics = evaluate_sample_sizes(
    data=data,
    sample_sizes=np.arange(25, 201, 25),
    groups=groups,
    n_draws=5,
)
print(metrics.head())

# Plot learning curves
fig = plot_sample_sizes(metrics, n_target=200)
fig.savefig("learning_curves.png")

Evaluate a SyngResult

When you have a SyngResult with group information (e.g., from a CVAE run), you can pass it directly and groups are auto-resolved:

import numpy as np
from syng_bts import generate, evaluate_sample_sizes, plot_sample_sizes

# Generate synthetic data with a conditional model
result = generate(
    data="BRCASubtypeSel_train",
    model="CVAE1-20",
    apply_log=True,
    epoch=50,
)

# Evaluate the generated data — groups are auto-resolved from result
metrics_gen = evaluate_sample_sizes(
    data=result,
    sample_sizes=np.arange(25, 201, 25),
    which="generated",
)

# Compare real vs generated learning curves
metrics_real = evaluate_sample_sizes(
    data=data,
    sample_sizes=np.arange(25, 201, 25),
    which="original",
)

fig = plot_sample_sizes(
    metric_real=metrics_real,
    n_target=200,
    metric_generated=metrics_gen,
)
fig.savefig("real_vs_generated.png")

Workflow

  1. Generate synthetic data using generate() (or load existing data).

  2. Evaluate with evaluate_sample_sizes() on both real and generated datasets.

  3. Visualize with plot_sample_sizes() to compare learning curves side by side.

Available Classifiers

The following classifiers are available via the methods parameter:

Name

Aliases

Description

LOGIS

LOGISTIC, LR

Ridge (L2-penalised) logistic regression via LogisticRegressionCV

SVM

Support Vector Machine with probability estimates

KNN

K-Nearest Neighbors (k=5)

RF

RANDOM_FOREST

Random Forest (100 trees)

XGB

XGBOOST

XGBoost gradient-boosted trees

All classifiers are evaluated using 5-fold stratified cross-validation.

Metrics

Each evaluation returns three metrics per classifier per sample size:

  • F1 Score (f1_score) — Macro-averaged F1

  • Accuracy (accuracy) — Overall classification accuracy

  • AUC (auc) — Area under ROC curve (one-vs-one, macro-averaged for multiclass)

Log Transform

By default, evaluate_sample_sizes() applies a log2(x + 1) transform (apply_log=True). Set apply_log=False when your input data is already log-transformed. The default behavior matches the preprocessing convention used in SyNG-BTS training.

Verbosity

The verbose parameter of evaluate_sample_sizes() controls console output during evaluation. It accepts the same levels used by the training functions (generate(), pilot_study(), transfer()):

Level

Name

Behaviour

0

"silent"

No output.

1

"minimal"

One dynamically updated overall progress-bar line across all sample sizes, draws, and methods (default), while showing current size index/n, draw, and method.

2

"detailed"

Per-draw / per-method metric lines (previous default behaviour).

Example:

# Detailed logging
metrics = evaluate_sample_sizes(data, sample_sizes=[50, 100],
                                groups=groups, verbose="detailed")

Sample-Size Shortcuts

sample_sizes accepts a list, numpy array, pandas Series, or a single integer. When a single integer k is provided it is interpreted as the desired number of equidistant sizes — the maximum equals the number of rows in the input data.

# Equivalent to sample_sizes=[5, 10, 15] for 15-row data
metrics = evaluate_sample_sizes(data, sample_sizes=3, groups=groups)

API Reference

syng_bts.evaluate_sample_sizes(data: pd.DataFrame | SyngResult, sample_sizes: list[int] | np.ndarray | pd.Series | int, groups: np.ndarray | pd.Series | list | None = None, which: str = 'generated', n_draws: int = 5, apply_log: bool = True, methods: list[str] | None = None, verbose: int | str = 'minimal') pd.DataFrame[source]

Evaluate classifiers across candidate sample sizes.

For each classifier and each candidate sample size, performs n_draws rounds of stratified sampling (proportional to class distribution), applies 5-fold cross-validation, and averages metrics across folds.

Parameters:
  • data (pd.DataFrame or SyngResult) – The dataset to evaluate. When a SyngResult is provided, the which parameter selects the data attribute and groups are auto-resolved from the corresponding *_groups field.

  • sample_sizes (list[int], np.ndarray, pd.Series, or int) – Candidate sample sizes to evaluate. Accepts a list, numpy array, or pandas Series of positive integers. When a single int is provided it is interpreted as the number of equidistant sizes to create — the maximum equals the number of data rows. For example, sample_sizes=3 with 15-row data produces [5, 10, 15].

  • groups (array-like or None) – Class labels corresponding to the rows of data. Required when data is a pd.DataFrame. When provided alongside a SyngResult, overrides the auto-resolved groups.

  • which (str, default "generated") – Selector when data is a SyngResult: "generated", "original", or "reconstructed".

  • n_draws (int, default 5) – Number of resampling repetitions for each sample size.

  • apply_log (bool, default True) – When True, a log2(x + 1) transform is applied to the data before evaluation.

  • methods (list[str] or None) – Classifier names to evaluate. Accepts canonical names ('LOGIS', 'SVM', 'KNN', 'RF', 'XGB') and common aliases ('LOGISTIC', 'LR', 'RANDOM_FOREST', 'XGBOOST'). Defaults to all five classifiers.

  • verbose (int or str, default "minimal") – Controls output verbosity. Accepts 0 / "silent" (no output), 1 / "minimal" (one dynamic overall progress bar across all sample sizes, draws, and methods), or 2 / "detailed" (per-draw/method metric lines).

Returns:

Columns: total_size, draw, method, f1_score, accuracy, auc.

Return type:

pd.DataFrame

Raises:
  • TypeError – If data is not a pd.DataFrame or SyngResult.

  • ValueError – If groups is missing when required, which is invalid, methods contains unknown names, sample_sizes is empty or contains non-positive values, or any sample size exceeds the number of available rows.

Examples

Using a DataFrame:

>>> df = pd.read_csv("mydata.csv")
>>> groups = df.pop("group")
>>> result = evaluate_sample_sizes(df, sample_sizes=[50, 100], groups=groups)

Using a SyngResult:

>>> from syng_bts import generate
>>> sr = generate(data="BRCASubtypeSel_test", model="CVAE1-20", epoch=10)
>>> result = evaluate_sample_sizes(sr, sample_sizes=[50], which="generated")
syng_bts.plot_sample_sizes(metric_real: DataFrame, n_target: int | list, metric_generated: DataFrame | None = None, metric_name: str = 'f1_score') matplotlib.pyplot.Figure[source]

Visualize IPLF learning curves fitted from evaluation metrics.

Fits inverse power-law curves to the evaluation metrics produced by evaluate_sample_sizes() and plots observed values, fitted curves, and 95% confidence intervals.

The returned figure is never displayed automatically — call fig.savefig(...) or plt.show() explicitly to display or save.

Parameters:
  • metric_real (pd.DataFrame) – Metrics from evaluate_sample_sizes() on real data.

  • n_target (int or list) – Target sample sizes for extrapolation reference.

  • metric_generated (pd.DataFrame or None) – Metrics from evaluate_sample_sizes() on generated data. When provided, a second column of panels is added.

  • metric_name (str, default "f1_score") – Metric to visualize ("f1_score", "accuracy", or "auc").

Returns:

The figure containing the learning-curve panels.

Return type:

matplotlib.figure.Figure

Examples

>>> metrics = evaluate_sample_sizes(df, [50, 100, 200], groups=g)
>>> fig = plot_sample_sizes(metrics, n_target=300)
>>> fig.savefig("learning_curves.png")