Sample-Size Evaluation (SyntheSize) ==================================== SyNG-BTS integrates the `SyntheSize `_ methodology for evaluating how classifier performance scales with sample size. This is useful for answering the question: *"How many samples do I need for reliable classification?"* The integration provides two public functions: - :func:`~syng_bts.evaluate_sample_sizes` — Evaluate classifiers across candidate sample sizes using stratified cross-validation. - :func:`~syng_bts.plot_sample_sizes` — Visualize inverse power-law (IPLF) learning curves fitted from evaluation metrics. .. contents:: Table of Contents :local: :depth: 2 Background ---------- The SyntheSize approach trains multiple classifiers (logistic regression, SVM, KNN, random forest, XGBoost) at varying sample sizes and fits inverse power-law curves to the resulting metrics (F1, accuracy, AUC). This reveals how classification performance scales with data volume and helps determine whether generating more synthetic samples would improve downstream analyses. For more details on the methodology, see: - **SyntheSize (R)**: https://github.com/LXQin/SyntheSize - **SyntheSize (Python)**: https://github.com/LXQin/SyntheSize_py - Qi Y, Wang X, Qin LX. *Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach.* Brief Bioinform. 2025;26(2):bbaf097. https://doi.org/10.1093/bib/bbaf097 Quick Start ----------- Evaluate a DataFrame ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import numpy as np import pandas as pd from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data # Load a bundled dataset data, groups = resolve_data("BRCASubtypeSel_test") # Evaluate classifiers across sample sizes metrics = evaluate_sample_sizes( data=data, sample_sizes=np.arange(25, 201, 25), groups=groups, n_draws=5, ) print(metrics.head()) # Plot learning curves fig = plot_sample_sizes(metrics, n_target=200) fig.savefig("learning_curves.png") Evaluate a SyngResult ~~~~~~~~~~~~~~~~~~~~~ When you have a :class:`~syng_bts.SyngResult` with group information (e.g., from a CVAE run), you can pass it directly and groups are auto-resolved: .. code-block:: python import numpy as np from syng_bts import generate, evaluate_sample_sizes, plot_sample_sizes # Generate synthetic data with a conditional model result = generate( data="BRCASubtypeSel_train", model="CVAE1-20", apply_log=True, epoch=50, ) # Evaluate the generated data — groups are auto-resolved from result metrics_gen = evaluate_sample_sizes( data=result, sample_sizes=np.arange(25, 201, 25), which="generated", ) # Compare real vs generated learning curves metrics_real = evaluate_sample_sizes( data=data, sample_sizes=np.arange(25, 201, 25), which="original", ) fig = plot_sample_sizes( metric_real=metrics_real, n_target=200, metric_generated=metrics_gen, ) fig.savefig("real_vs_generated.png") Workflow -------- 1. **Generate synthetic data** using :func:`~syng_bts.generate` (or load existing data). 2. **Evaluate** with :func:`~syng_bts.evaluate_sample_sizes` on both real and generated datasets. 3. **Visualize** with :func:`~syng_bts.plot_sample_sizes` to compare learning curves side by side. Available Classifiers --------------------- The following classifiers are available via the ``methods`` parameter: .. list-table:: :header-rows: 1 :widths: 15 20 65 * - Name - Aliases - Description * - ``LOGIS`` - ``LOGISTIC``, ``LR`` - Ridge (L2-penalised) logistic regression via ``LogisticRegressionCV`` * - ``SVM`` - - Support Vector Machine with probability estimates * - ``KNN`` - - K-Nearest Neighbors (k=5) * - ``RF`` - ``RANDOM_FOREST`` - Random Forest (100 trees) * - ``XGB`` - ``XGBOOST`` - XGBoost gradient-boosted trees All classifiers are evaluated using 5-fold stratified cross-validation. Metrics ------- Each evaluation returns three metrics per classifier per sample size: - **F1 Score** (``f1_score``) — Macro-averaged F1 - **Accuracy** (``accuracy``) — Overall classification accuracy - **AUC** (``auc``) — Area under ROC curve (one-vs-one, macro-averaged for multiclass) Log Transform ------------- By default, :func:`~syng_bts.evaluate_sample_sizes` applies a ``log2(x + 1)`` transform (``apply_log=True``). Set ``apply_log=False`` when your input data is already log-transformed. The default behavior matches the preprocessing convention used in SyNG-BTS training. Verbosity --------- The ``verbose`` parameter of :func:`~syng_bts.evaluate_sample_sizes` controls console output during evaluation. It accepts the same levels used by the training functions (:func:`~syng_bts.generate`, :func:`~syng_bts.pilot_study`, :func:`~syng_bts.transfer`): .. list-table:: :header-rows: 1 :widths: 10 15 75 * - Level - Name - Behaviour * - ``0`` - ``"silent"`` - No output. * - ``1`` - ``"minimal"`` - One dynamically updated overall progress-bar line across all sample sizes, draws, and methods (default), while showing current size index/``n``, draw, and method. * - ``2`` - ``"detailed"`` - Per-draw / per-method metric lines (previous default behaviour). Example: .. code-block:: python # Detailed logging metrics = evaluate_sample_sizes(data, sample_sizes=[50, 100], groups=groups, verbose="detailed") Sample-Size Shortcuts --------------------- ``sample_sizes`` accepts a **list**, **numpy array**, **pandas Series**, or a **single integer**. When a single integer *k* is provided it is interpreted as the desired *number* of equidistant sizes — the maximum equals the number of rows in the input data. .. code-block:: python # Equivalent to sample_sizes=[5, 10, 15] for 15-row data metrics = evaluate_sample_sizes(data, sample_sizes=3, groups=groups) API Reference ------------- .. autofunction:: syng_bts.evaluate_sample_sizes :no-index: .. autofunction:: syng_bts.plot_sample_sizes :no-index: