Synthetic Data Generation

This page documents the core synthetic data generation functions in SyNG-BTS.

Overview

SyNG-BTS provides three core synthetic data generation functions. All accept data as a pandas DataFrame, a CSV file path, or the name of a bundled dataset, and return rich result objects (SyngResult or PilotResult). See Configuration Reference for all available parameters and model choices.

  • generate() — Train a model and produce synthetic samples

  • pilot_study() — Sweep over pilot sizes with replicated draws

  • transfer() — Pre-train on source data, fine-tune on target data

generate

Train a generative model on a dataset and generate synthetic samples. This is the primary entry point for single model training.

syng_bts.generate(data: DataFrame | str | Path, *, name: str | None = None, groups: Series | ndarray | None = None, new_size: int | list[int] = 500, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, val_ratio: float = 0.2, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, use_scheduler: bool = False, step_size: int = 10, gamma: float = 0.5, cap: bool = False, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') SyngResult[source]

Train a deep generative model and generate synthetic data.

This is the primary entry point for training a single model and generating synthetic samples. It replaces the legacy ApplyExperiment function.

Parameters:
  • data (DataFrame, str, or Path) – Input data — a pandas DataFrame, a path to a CSV file, or the name of a bundled dataset (e.g. "SKCMPositive_4").

  • name (str or None) – Short name for output filenames. Derived automatically when None.

  • groups (pd.Series, np.ndarray, or None) – Optional binary group labels. When provided, these labels take precedence over bundled dataset groups.

  • new_size (int or list[int]) –

    Generation size.

    • If int: generate exactly new_size samples.

      For grouped data, counts are split by the input group ratio and rounded to integers.

    • If list[int]: explicit grouped counts

      [n_group_0, n_group_1].

    For grouped data, group_0 is the base group used by create_labels() (first encountered group value) and group_1 is the other group.

  • model (str) – Model specification, e.g. "VAE1-10" (parsed into model type and kl_weight).

  • apply_log (bool) – Apply log2(x + 1) preprocessing.

  • batch_frac (float) – Batch size as a fraction of sample count.

  • learning_rate (float) – Optimizer learning rate.

  • epoch (int or None) –

    Fixed epoch count, or None for early stopping.

    The interaction between epoch and early_stop_patience:

    epoch

    early_stop_patience

    Behaviour

    None

    None

    Early stopping ON, patience=30, max 1 000 epochs

    None

    30

    Early stopping ON, patience=30, max 1 000 epochs

    500

    None

    Early stopping OFF, run exactly 500 epochs

    500

    30

    Early stopping ON, patience=30, max 500 epochs

  • val_ratio (float) – Validation split ratio (AE family only).

  • early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When None and epoch is also None, defaults to 30.

  • off_aug (str or None) – Offline augmentation: "AE_head", "Gaussian_head", or None.

  • AE_head_num (int) – Fold multiplier for AE-head augmentation.

  • Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.

  • use_scheduler (bool) – Enable learning-rate scheduler (AE family).

  • step_size (int) – Scheduler step size.

  • gamma (float) – Scheduler gamma.

  • cap (bool) – Cap generated values to observed range.

  • random_seed (int) – Random seed for reproducibility.

  • CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE (512→256→128→64 instead of 256→128→64). Suitable for high-dimensional data like RNA. Ignored for non-CVAE models (default: False).

  • output_dir (str, Path, or None) – If set, automatically save results to this directory.

  • verbose (int or str) –

    Verbosity level for training output.

    • "silent" or 0 — no output during training.

    • "minimal" or 1 (default) — print only training summaries and early-stopping messages.

    • "detailed" or 2 — print per-epoch progress (epoch number, loss values, elapsed time, learning rate).

Returns:

Rich result object containing generated data, loss log, reconstructed data (AE/VAE/CVAE), model state, and metadata.

Return type:

SyngResult

Examples

from syng_bts import generate

# Generate synthetic data using a bundled dataset
result = generate(
    data="SKCMPositive_4",
    model="VAE1-10",
    new_size=500,
    batch_frac=0.1,
    learning_rate=0.0005,
)

# Access results
print(result.generated_data.shape)  # (500, n_features)
print(result.summary())

# Plot training loss (one figure per loss column)
figs = result.plot_loss()

# Save to disk
result.save("./my_output/")
import pandas as pd
from syng_bts import generate

# Use your own DataFrame
my_data = pd.read_csv("my_dataset.csv")
result = generate(
    data=my_data,
    name="my_dataset",
    model="WGANGP",
    new_size=1000,
    epoch=50,
)

pilot_study

Run pilot studies to evaluate models across multiple pilot sizes. For each pilot size, five random sub-samples are drawn and a model is trained on each.

syng_bts.pilot_study(data: DataFrame | str | Path, pilot_size: list[int], *, name: str | None = None, groups: Series | ndarray | None = None, n_draws: int = 5, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') PilotResult[source]

Sweep over pilot sizes with replicated random draws.

For each pilot size, n_draws random sub-samples are drawn from the original data. A model is trained on each sub-sample and synthetic data equal to n_draws times the sub-sample size is generated.

This replaces the legacy PilotExperiment function.

Parameters:
  • data (DataFrame, str, or Path) – Input data.

  • pilot_size (list[int]) – List of pilot sizes to evaluate.

  • name (str or None) – Short name for output filenames.

  • groups (pd.Series, np.ndarray, or None) – Optional binary group labels. When provided, these labels take precedence over bundled dataset groups.

  • n_draws (int) – Number of replicated random draws per pilot size (default: 5). Must be a positive integer.

  • model (str) – Model specification (e.g. "VAE1-10").

  • apply_log (bool) – Apply log2(x + 1) preprocessing.

  • batch_frac (float) – Batch size as a fraction of sample count.

  • learning_rate (float) – Optimizer learning rate.

  • epoch (int or None) – Fixed epoch count or None for early stopping. See generate() for the full interaction table.

  • early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When None and epoch is also None, defaults to 30. See generate() for the full interaction table.

  • off_aug (str or None) – Offline augmentation mode.

  • AE_head_num (int) – Fold multiplier for AE-head augmentation.

  • Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.

  • random_seed (int) – Base random seed for reproducibility.

  • CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE — see generate().

  • output_dir (str, Path, or None) – If set, automatically save results to this directory.

  • verbose (int or str) – Verbosity level — see generate() for details.

Returns:

Wrapper containing one SyngResult per (pilot_size, draw).

Return type:

PilotResult

Examples

from syng_bts import pilot_study

# Evaluate VAE across different pilot sizes
pilot = pilot_study(
    data="SKCMPositive_4",
    pilot_size=[50, 100, 200],
    model="VAE1-10",
    batch_frac=0.1,
    learning_rate=0.0005,
)

# Access individual run results
run = pilot.runs[(100, 1)]  # (pilot_size, draw_index)
print(run.generated_data.shape)

# All runs overlaid (one figure per loss column)
figs = pilot.plot_loss(style="overlay_runs")

# Mean ± std across runs
figs = pilot.plot_loss(style="mean_band")

# Save all results
pilot.save("./pilot_output/")
from syng_bts import pilot_study

# Using custom data with CVAE
pilot = pilot_study(
    data="BRCASubtypeSel_train",
    pilot_size=[50, 100],
    model="CVAE1-20",
    epoch=100,
    output_dir="./results/",
)

transfer

Transfer learning: pre-train on a source dataset, then fine-tune and generate on a target dataset.

syng_bts.transfer(source_data: DataFrame | str | Path, target_data: DataFrame | str | Path, *, source_name: str | None = None, target_name: str | None = None, source_groups: Series | ndarray | None = None, target_groups: Series | ndarray | None = None, new_size: int | list[int] = 500, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') SyngResult[source]

Train on source data, then fine-tune and generate on target data.

The model is first trained on source_data and its learned state is kept in-memory, then fine-tuned on target_data. This is a single-run operation returning a SyngResult.

This replaces the legacy TransferExperiment function.

Parameters:
  • source_data (DataFrame, str, or Path) – Pre-training dataset.

  • target_data (DataFrame, str, or Path) – Fine-tuning / target dataset.

  • source_name (str or None) – Short name for the source dataset.

  • target_name (str or None) – Short name for the target dataset.

  • source_groups (pd.Series, np.ndarray, or None) – Optional binary groups for the source dataset.

  • target_groups (pd.Series, np.ndarray, or None) – Optional binary groups for the target dataset.

  • new_size (int or list[int]) –

    Generation size for the fine-tuned target model.

    • If int: generate exactly new_size samples.

      For grouped data, counts are split by the target input group ratio and rounded to integers.

    • If list[int]: explicit grouped counts

      [n_group_0, n_group_1].

    For grouped data, group_0 is the base group used by create_labels() (first encountered group value) and group_1 is the other group.

  • model (str) – Model specification.

  • apply_log (bool) – Apply log2 preprocessing.

  • batch_frac (float) – Batch fraction.

  • learning_rate (float) – Learning rate.

  • epoch (int or None) – Fixed epoch count, or None for early stopping. See generate() for the full interaction table.

  • early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When None and epoch is also None, defaults to 30. See generate() for the full interaction table.

  • off_aug (str or None) – Offline augmentation mode.

  • AE_head_num (int) – Fold multiplier for AE-head augmentation.

  • Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.

  • random_seed (int) – Random seed.

  • CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE — see generate().

  • output_dir (str, Path, or None) – If set, save results here.

  • verbose (int or str) – Verbosity level — see generate() for details.

Returns:

Result from the fine-tuned target-phase model.

Return type:

SyngResult

Examples

from syng_bts import transfer

# Transfer from PRAD to BRCA using MAF
result = transfer(
    source_data="PRAD",
    target_data="BRCA",
    new_size=500,
    model="maf",
    apply_log=True,
    epoch=10,
)

print(result.generated_data.shape)
result.save("./transfer_output/")

transfer() is a single-run operation and always returns a SyngResult. For pilot sweeps over target sample sizes, use pilot_study().

Choosing a Model

SyNG-BTS supports several generative models:

Model

Description

VAE1-10

VAE with 1:10 loss ratio

VAE1-20

VAE with 1:20 loss ratio

CVAE1-10

Conditional VAE (1:10)

CVAE1-20

Conditional VAE (1:20)

GAN

Standard GAN

WGANGP

Wasserstein GAN-GP

maf

Masked Autoregressive Flow

See Configuration Reference for detailed parameter descriptions.