Sample-Size Evaluation (SyntheSize)

SyNG-BTS integrates the SyntheSize methodology for exploring how classifier performance changes across candidate subset sizes. It visualizes learning-curve behavior; it does not calculate a required or optimal sample size.

The integration provides two public functions:

evaluate_sample_sizes() — Evaluate classifiers across candidate sample sizes using stratified cross-validation or a fixed external evaluation set.
plot_sample_sizes() — Visualize inverse power-law (IPLF) learning curves fitted from evaluation metrics.

Background 

The SyntheSize approach trains multiple classifiers (logistic regression, SVM, KNN, random forest, XGBoost) at varying sample sizes and fits inverse power-law curves to the resulting metrics (F1, accuracy, AUC). This reveals how classification performance changes with data volume and supports exploratory assessment of whether generating more synthetic samples could improve downstream analyses.

For more details on the methodology, see:

SyntheSize (R): https://github.com/LXQin/SyntheSize
SyntheSize (Python): https://github.com/LXQin/SyntheSize_py
Qi Y, Wang X, Qin LX. Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach. Brief Bioinform. 2025;26(2):bbaf097. https://doi.org/10.1093/bib/bbaf097

Quick Start 

Evaluate a DataFrame 

import numpy as np
import pandas as pd
from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data

# Load a bundled dataset
data, groups = resolve_data("BRCASubtypeSel_test")

# Evaluate classifiers across sample sizes
metrics = evaluate_sample_sizes(
    data=data,
    sample_sizes=np.arange(25, 201, 25),
    groups=groups,
    n_draws=5,
)
print(metrics.head())

# Plot learning curves
fig = plot_sample_sizes(metrics)
fig.savefig("learning_curves.png")

Evaluate a SyngResult 

When you have a SyngResult with group information (e.g., from a CVAE run), you can pass it directly and groups are auto-resolved:

import numpy as np
from syng_bts import generate, evaluate_sample_sizes, plot_sample_sizes

# Generate synthetic data with a conditional model
result = generate(
    data="BRCASubtypeSel_train",
    model="CVAE1-20",
    apply_log=True,
    epoch=50,
)

# Evaluate the generated data — groups are auto-resolved from result
metrics_gen = evaluate_sample_sizes(
    data=result,
    sample_sizes=np.arange(25, 201, 25),
    which="generated",
)

# Compare real vs generated learning curves
metrics_real = evaluate_sample_sizes(
    data=result,
    sample_sizes=np.arange(25, 201, 25),
    which="original",
)

fig = plot_sample_sizes(
    metric_real=metrics_real,
    metric_generated=metrics_gen,
)
fig.savefig("real_vs_generated.png")

Evaluate Against a Fixed Empirical Test Set 

Pass test_data and test_groups together to train each classifier on the complete candidate subset and evaluate it once on fixed external rows. This is useful for comparing real and generated candidate data against the same empirical observations. The external observations should not have been used to train the generative model.

metrics_real = evaluate_sample_sizes(
    data=real_candidate_data,
    sample_sizes=[50, 100, 150],
    groups=real_candidate_groups,
    test_data=empirical_test_data,
    test_groups=empirical_test_groups,
)

metrics_generated = evaluate_sample_sizes(
    data=generated_candidate_data,
    sample_sizes=[50, 100, 150],
    groups=generated_candidate_groups,
    test_data=empirical_test_data,
    test_groups=empirical_test_groups,
)

Both calls return the same one-row-per-size/draw/method table used by the internal cross-validation mode.

Workflow 

Generate synthetic data using generate() (or load existing data).
Evaluate with evaluate_sample_sizes() on both real and generated datasets, optionally using the same fixed empirical test set.
Visualize with plot_sample_sizes() to compare learning curves side by side.

Available Classifiers 

The following classifiers are available via the methods parameter:

Name	Aliases	Description
`LOGIS`	`LOGISTIC`, `LR`	Ridge (L2-penalised) logistic regression via `LogisticRegressionCV`
`SVM`		Support Vector Machine with probability estimates
`KNN`		K-Nearest Neighbors (k=5)
`RF`	`RANDOM_FOREST`	Random Forest (100 trees)
`XGB`	`XGBOOST`	XGBoost gradient-boosted trees

Without an external test set, classifiers are evaluated using 5-fold stratified cross-validation. With an external test set, each classifier is trained on the complete candidate subset and evaluated once on the fixed rows.

Meaning of Candidate Size 

The total_size value and plotted x-axis represent the candidate subset size before evaluation. In internal cross-validation mode, each classifier is trained on about 80% of that subset in each fold. In external-evaluation mode, each classifier is trained on the complete candidate subset.

Metrics 

Each evaluation returns three metrics per classifier per sample size:

F1 Score (f1_score) — Macro-averaged F1
Accuracy (accuracy) — Overall classification accuracy
AUC (auc) — Area under ROC curve (one-vs-one, macro-averaged for multiclass)

Plot Scaling 

By default, plot_sample_sizes() fixes every panel’s y-axis to (0.4, 1). Pass a different two-value tuple with y_limits to choose a custom range, or pass None to let Matplotlib choose limits automatically:

# Show the full metric range automatically
fig = plot_sample_sizes(metrics, y_limits=None)

# Use a custom fixed range for every panel
fig = plot_sample_sizes(metrics, y_limits=(0.6, 1))

Log Transform 

By default, evaluate_sample_sizes() applies a log2(x + 1) transform (apply_log=True). Set apply_log=False when your input data is already log-transformed. The default behavior matches the preprocessing convention used in SyNG-BTS training. In either evaluation mode, feature standardization is fitted on the candidate training data and then applied unchanged to the corresponding fold or external evaluation data.

Curve Fitting and Confidence Intervals 

plot_sample_sizes() displays approximate pointwise 95% confidence intervals for the fitted inverse-power-law mean curves. The bands propagate fitted-parameter covariance with the delta method; they are not prediction intervals for individual classifier results.

Three distinct sample sizes are sufficient to fit the three-parameter curve, but at least four fitted points are required to estimate parameter covariance and display a confidence band. With exactly three points, the fitted curve is shown without a band and a warning explains why.

The nonlinear fit uses the same increasing row weights as the R implementation. After ordering the m curve points by candidate size, their weights are 1/m, 2/m, ..., m/m, giving larger candidate sizes greater weight.

Verbosity 

The verbose parameter of evaluate_sample_sizes() controls console output during evaluation. It accepts the same levels used by the training functions (generate(), pilot_study(), transfer()):

Level	Name	Behaviour
`0`	`"silent"`	No output.
`1`	`"minimal"`	One dynamically updated overall progress-bar line across all sample sizes, draws, and methods (default), while showing current size index/`n`, draw, and method.
`2`	`"detailed"`	Per-draw / per-method metric lines (previous default behaviour).

Example:

# Detailed logging
metrics = evaluate_sample_sizes(data, sample_sizes=[50, 100],
                                groups=groups, verbose="detailed")

Reproducibility 

Set random_seed to an integer to reproduce candidate sampling, shuffled cross-validation splits, and stochastic classifier fits.

metrics = evaluate_sample_sizes(
    data,
    sample_sizes=[50, 100],
    groups=groups,
    random_seed=42,
)

Sample-Size Shortcuts 

sample_sizes accepts a list, numpy array, pandas Series, or a single integer. When a single integer k is provided it is interpreted as the desired number of equidistant sizes — the maximum equals the number of rows in the input data. The grid count k cannot exceed the number of rows.

# Equivalent to sample_sizes=[5, 10, 15] for 15-row data
metrics = evaluate_sample_sizes(data, sample_sizes=3, groups=groups)

API Reference 

Evaluate classifiers across candidate sample sizes.

For each classifier and candidate sample size, performs n_draws rounds of stratified sampling proportional to the input class distribution. When no external test set is supplied, metrics are averaged over 5-fold stratified cross-validation. When test_data and test_groups are supplied, each classifier is trained on the complete candidate subset and evaluated once on those fixed external rows.

The returned total_size is the candidate subset size. Internal cross-validation trains each fold on about 80% of that subset; external evaluation trains on the complete subset.

Parameters:

data (pd.DataFrame or SyngResult) – The dataset to evaluate. When a SyngResult is provided, the which parameter selects the data attribute and groups are auto-resolved from the corresponding *_groups field.
sample_sizes (list[int], np.ndarray, pd.Series, or int) – Candidate sample sizes to evaluate. Accepts a list, numpy array, or pandas Series of positive integers. When a single int is provided it is interpreted as the number of equidistant sizes to create — the maximum equals the number of data rows. For example, sample_sizes=3 with 15-row data produces [5, 10, 15]. The grid count cannot exceed the number of data rows.
groups (array-like or None) – Class labels corresponding to the rows of data. Required when data is a pd.DataFrame. When provided alongside a SyngResult, overrides the auto-resolved groups.
which (str, default "generated") – Selector when data is a SyngResult: "generated", "original", or "reconstructed".
n_draws (int, default 5) – Number of resampling repetitions for each sample size.
apply_log (bool, default True) – When True, a log2(x + 1) transform is applied to the candidate and external data before evaluation.
methods (list[str] or None) – Classifier names to evaluate. Accepts canonical names ('LOGIS', 'SVM', 'KNN', 'RF', 'XGB') and common aliases ('LOGISTIC', 'LR', 'RANDOM_FOREST', 'XGBOOST'). Defaults to all five classifiers.
verbose (int or str, default "minimal") – Controls output verbosity. Accepts 0 / "silent" (no output), 1 / "minimal" (one dynamic overall progress bar across all sample sizes, draws, and methods), or 2 / "detailed" (per-draw/method metric lines).
test_data (pd.DataFrame or None) – Fixed external evaluation data. Must have the same feature columns as data. When supplied, test_groups is also required. External rows are transformed using preprocessing fitted on each candidate subset.
test_groups (array-like or None) – Class labels corresponding to the rows of test_data. Must be supplied together with test_data and use labels present in groups.
random_seed (int or None) – Seed for candidate sampling, shuffled cross-validation, and stochastic classifiers.

Returns:

Columns: total_size, draw, method, f1_score, accuracy, auc.

Return type:

pd.DataFrame

Raises:

TypeError – If data is not a pd.DataFrame or SyngResult, or supplied test_data is not a pd.DataFrame.
ValueError – If groups is missing when required, which is invalid, methods contains unknown names, sample_sizes is empty or contains non-positive values, or any sample size exceeds the number of available rows. Also raised when only one external argument is supplied or the external rows, labels, or feature columns are incompatible, or when numerical values are invalid.

Examples

Using a DataFrame:

>>> df = pd.read_csv("mydata.csv")
>>> groups = df.pop("group")
>>> result = evaluate_sample_sizes(df, sample_sizes=[50, 100], groups=groups)

Using a SyngResult:

>>> from syng_bts import generate
>>> sr = generate(data="BRCASubtypeSel_test", model="CVAE1-20", epoch=10)
>>> result = evaluate_sample_sizes(sr, sample_sizes=[50], which="generated")

Evaluating candidate data on a fixed empirical test set:

>>> result = evaluate_sample_sizes(
...     df,
...     sample_sizes=[50, 100],
...     groups=groups,
...     test_data=empirical_test,
...     test_groups=empirical_test_groups,
... )

syng_bts.plot_sample_sizes(metric_real: DataFrame, metric_generated: DataFrame | None = None, metric_name: str = 'f1_score', y_limits: tuple[float, float] | None = (0.4, 1)) → matplotlib.pyplot.Figure[source]

Visualize IPLF learning curves fitted from evaluation metrics.

Fits weighted inverse power-law curves to the evaluation metrics produced by evaluate_sample_sizes() and plots observed values, fitted curves, and approximate pointwise 95% confidence intervals for the fitted mean curves. These bands are not prediction intervals. Three distinct sample sizes are sufficient to fit the curve, but at least four fitted points are required to estimate parameter covariance and display a confidence band.

The returned figure is never displayed automatically — call fig.savefig(...) or plt.show() explicitly to display or save.

Parameters:

metric_real (pd.DataFrame) – Metrics from evaluate_sample_sizes() on real data.
metric_generated (pd.DataFrame or None) – Metrics from evaluate_sample_sizes() on generated data. When provided, a second column of panels is added.
metric_name (str, default "f1_score") – Metric to visualize ("f1_score", "accuracy", or "auc").
y_limits (tuple of float or None, default (0.4, 1)) – Limits applied to the y-axis of every panel. Set to None to use Matplotlib’s automatic scaling.

Returns:

The figure containing the learning-curve panels.

Return type:

matplotlib.figure.Figure

Examples

>>> metrics = evaluate_sample_sizes(df, [50, 100, 150, 200], groups=g)
>>> fig = plot_sample_sizes(metrics)
>>> fig.savefig("learning_curves.png")