Sample-Size Evaluation (SyntheSize)
SyNG-BTS integrates the SyntheSize methodology for evaluating how classifier performance scales with sample size. This is useful for answering the question: “How many samples do I need for reliable classification?”
The integration provides two public functions:
evaluate_sample_sizes()— Evaluate classifiers across candidate sample sizes using stratified cross-validation.plot_sample_sizes()— Visualize inverse power-law (IPLF) learning curves fitted from evaluation metrics.
Background
The SyntheSize approach trains multiple classifiers (logistic regression, SVM, KNN, random forest, XGBoost) at varying sample sizes and fits inverse power-law curves to the resulting metrics (F1, accuracy, AUC). This reveals how classification performance scales with data volume and helps determine whether generating more synthetic samples would improve downstream analyses.
For more details on the methodology, see:
SyntheSize (R): https://github.com/LXQin/SyntheSize
SyntheSize (Python): https://github.com/LXQin/SyntheSize_py
Qi Y, Wang X, Qin LX. Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach. Brief Bioinform. 2025;26(2):bbaf097. https://doi.org/10.1093/bib/bbaf097
Quick Start
Evaluate a DataFrame
import numpy as np
import pandas as pd
from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data
# Load a bundled dataset
data, groups = resolve_data("BRCASubtypeSel_test")
# Evaluate classifiers across sample sizes
metrics = evaluate_sample_sizes(
data=data,
sample_sizes=np.arange(25, 201, 25),
groups=groups,
n_draws=5,
)
print(metrics.head())
# Plot learning curves
fig = plot_sample_sizes(metrics, n_target=200)
fig.savefig("learning_curves.png")
Evaluate a SyngResult
When you have a SyngResult with group information (e.g.,
from a CVAE run), you can pass it directly and groups are auto-resolved:
import numpy as np
from syng_bts import generate, evaluate_sample_sizes, plot_sample_sizes
# Generate synthetic data with a conditional model
result = generate(
data="BRCASubtypeSel_train",
model="CVAE1-20",
apply_log=True,
epoch=50,
)
# Evaluate the generated data — groups are auto-resolved from result
metrics_gen = evaluate_sample_sizes(
data=result,
sample_sizes=np.arange(25, 201, 25),
which="generated",
)
# Compare real vs generated learning curves
metrics_real = evaluate_sample_sizes(
data=data,
sample_sizes=np.arange(25, 201, 25),
which="original",
)
fig = plot_sample_sizes(
metric_real=metrics_real,
n_target=200,
metric_generated=metrics_gen,
)
fig.savefig("real_vs_generated.png")
Workflow
Generate synthetic data using
generate()(or load existing data).Evaluate with
evaluate_sample_sizes()on both real and generated datasets.Visualize with
plot_sample_sizes()to compare learning curves side by side.
Available Classifiers
The following classifiers are available via the methods parameter:
Name |
Aliases |
Description |
|---|---|---|
|
|
Ridge (L2-penalised) logistic regression via |
|
Support Vector Machine with probability estimates |
|
|
K-Nearest Neighbors (k=5) |
|
|
|
Random Forest (100 trees) |
|
|
XGBoost gradient-boosted trees |
All classifiers are evaluated using 5-fold stratified cross-validation.
Metrics
Each evaluation returns three metrics per classifier per sample size:
F1 Score (
f1_score) — Macro-averaged F1Accuracy (
accuracy) — Overall classification accuracyAUC (
auc) — Area under ROC curve (one-vs-one, macro-averaged for multiclass)
Log Transform
By default, evaluate_sample_sizes() applies a
log2(x + 1) transform (apply_log=True). Set apply_log=False
when your input data is already log-transformed. The default behavior matches
the preprocessing convention used in SyNG-BTS training.
Verbosity
The verbose parameter of evaluate_sample_sizes() controls
console output during evaluation. It accepts the same levels used by the
training functions (generate(), pilot_study(),
transfer()):
Level |
Name |
Behaviour |
|---|---|---|
|
|
No output. |
|
|
One dynamically updated overall progress-bar line across all
sample sizes, draws, and methods (default), while showing current
size index/ |
|
|
Per-draw / per-method metric lines (previous default behaviour). |
Example:
# Detailed logging
metrics = evaluate_sample_sizes(data, sample_sizes=[50, 100],
groups=groups, verbose="detailed")
Sample-Size Shortcuts
sample_sizes accepts a list, numpy array, pandas Series, or a
single integer. When a single integer k is provided it is interpreted as
the desired number of equidistant sizes — the maximum equals the number of
rows in the input data.
# Equivalent to sample_sizes=[5, 10, 15] for 15-row data
metrics = evaluate_sample_sizes(data, sample_sizes=3, groups=groups)
API Reference
- syng_bts.evaluate_sample_sizes(data: pd.DataFrame | SyngResult, sample_sizes: list[int] | np.ndarray | pd.Series | int, groups: np.ndarray | pd.Series | list | None = None, which: str = 'generated', n_draws: int = 5, apply_log: bool = True, methods: list[str] | None = None, verbose: int | str = 'minimal') pd.DataFrame[source]
Evaluate classifiers across candidate sample sizes.
For each classifier and each candidate sample size, performs n_draws rounds of stratified sampling (proportional to class distribution), applies 5-fold cross-validation, and averages metrics across folds.
- Parameters:
data (pd.DataFrame or SyngResult) – The dataset to evaluate. When a
SyngResultis provided, the which parameter selects the data attribute and groups are auto-resolved from the corresponding*_groupsfield.sample_sizes (list[int], np.ndarray, pd.Series, or int) – Candidate sample sizes to evaluate. Accepts a list, numpy array, or pandas Series of positive integers. When a single int is provided it is interpreted as the number of equidistant sizes to create — the maximum equals the number of data rows. For example,
sample_sizes=3with 15-row data produces[5, 10, 15].groups (array-like or None) – Class labels corresponding to the rows of data. Required when data is a
pd.DataFrame. When provided alongside aSyngResult, overrides the auto-resolved groups.which (str, default
"generated") – Selector when data is aSyngResult:"generated","original", or"reconstructed".n_draws (int, default 5) – Number of resampling repetitions for each sample size.
apply_log (bool, default True) – When
True, alog2(x + 1)transform is applied to the data before evaluation.methods (list[str] or None) – Classifier names to evaluate. Accepts canonical names (
'LOGIS','SVM','KNN','RF','XGB') and common aliases ('LOGISTIC','LR','RANDOM_FOREST','XGBOOST'). Defaults to all five classifiers.verbose (int or str, default "minimal") – Controls output verbosity. Accepts
0/"silent"(no output),1/"minimal"(one dynamic overall progress bar across all sample sizes, draws, and methods), or2/"detailed"(per-draw/method metric lines).
- Returns:
Columns:
total_size,draw,method,f1_score,accuracy,auc.- Return type:
pd.DataFrame
- Raises:
TypeError – If data is not a
pd.DataFrameorSyngResult.ValueError – If groups is missing when required, which is invalid, methods contains unknown names, sample_sizes is empty or contains non-positive values, or any sample size exceeds the number of available rows.
Examples
Using a DataFrame:
>>> df = pd.read_csv("mydata.csv") >>> groups = df.pop("group") >>> result = evaluate_sample_sizes(df, sample_sizes=[50, 100], groups=groups)
Using a SyngResult:
>>> from syng_bts import generate >>> sr = generate(data="BRCASubtypeSel_test", model="CVAE1-20", epoch=10) >>> result = evaluate_sample_sizes(sr, sample_sizes=[50], which="generated")
- syng_bts.plot_sample_sizes(metric_real: DataFrame, n_target: int | list, metric_generated: DataFrame | None = None, metric_name: str = 'f1_score') matplotlib.pyplot.Figure[source]
Visualize IPLF learning curves fitted from evaluation metrics.
Fits inverse power-law curves to the evaluation metrics produced by
evaluate_sample_sizes()and plots observed values, fitted curves, and 95% confidence intervals.The returned figure is never displayed automatically — call
fig.savefig(...)orplt.show()explicitly to display or save.- Parameters:
metric_real (pd.DataFrame) – Metrics from
evaluate_sample_sizes()on real data.n_target (int or list) – Target sample sizes for extrapolation reference.
metric_generated (pd.DataFrame or None) – Metrics from
evaluate_sample_sizes()on generated data. When provided, a second column of panels is added.metric_name (str, default
"f1_score") – Metric to visualize ("f1_score","accuracy", or"auc").
- Returns:
The figure containing the learning-curve panels.
- Return type:
matplotlib.figure.Figure
Examples
>>> metrics = evaluate_sample_sizes(df, [50, 100, 200], groups=g) >>> fig = plot_sample_sizes(metrics, n_target=300) >>> fig.savefig("learning_curves.png")