Synthetic Data Generation

This page documents the core synthetic data generation functions in SyNG-BTS.

Overview 

SyNG-BTS provides three core synthetic data generation functions. All accept data as a pandas DataFrame, a CSV file path, or the name of a bundled dataset, and return rich result objects (SyngResult or PilotResult). See Configuration Reference for all available parameters and model choices.

generate() — Train a model and produce synthetic samples
pilot_study() — Sweep over pilot sizes with replicated draws
transfer() — Pre-train on source data, fine-tune on target data

generate 

Train a generative model on a dataset and generate synthetic samples. This is the primary entry point for single model training.

syng_bts.generate(data: DataFrame | str | Path, *, name: str | None = None, groups: Series | ndarray | None = None, new_size: int | list[int] = 500, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, val_ratio: float = 0.2, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, use_scheduler: bool = False, step_size: int = 10, gamma: float = 0.5, cap: bool = False, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') → SyngResult[source]

Train a deep generative model and generate synthetic data.

This is the primary entry point for training a single model and generating synthetic samples. It replaces the legacy ApplyExperiment function.

Parameters:

data (DataFrame, str, or Path) – Input data — a pandas DataFrame, a path to a CSV file, or the name of a bundled dataset (e.g. "SKCMPositive_4").
name (str or None) – Short name for output filenames. Derived automatically when None.
groups (pd.Series, np.ndarray, or None) – Optional binary group labels. When provided, these labels take precedence over bundled dataset groups.
new_size (int or list[int]) –
Generation size.
- If int: generate exactly new_size samples.
  For grouped data, counts are split by the input group ratio and rounded to integers.
- If list[int]: explicit grouped counts
  [n_group_0, n_group_1].
For grouped data, group_0 is the base group used by create_labels() (first encountered group value) and group_1 is the other group.
model (str) – Model specification, e.g. "VAE1-10" (parsed into model type and kl_weight).
apply_log (bool) – Apply log2(x + 1) preprocessing.
batch_frac (float) – Batch size as a fraction of sample count.
learning_rate (float) – Optimizer learning rate.

epoch (int or None) –

Fixed epoch count, or None for early stopping.

The interaction between epoch and early_stop_patience:

`epoch`	`early_stop_patience`	Behaviour
`None`	`None`	Early stopping ON, patience=30, max 1 000 epochs
`None`	`30`	Early stopping ON, patience=30, max 1 000 epochs
`500`	`None`	Early stopping OFF, run exactly 500 epochs
`500`	`30`	Early stopping ON, patience=30, max 500 epochs

val_ratio (float) – Validation split ratio (AE family only).
early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When None and epoch is also None, defaults to 30.
off_aug (str or None) – Offline augmentation: "AE_head", "Gaussian_head", or None.
AE_head_num (int) – Fold multiplier for AE-head augmentation.
Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.
use_scheduler (bool) – Enable learning-rate scheduler (AE family).
step_size (int) – Scheduler step size.
gamma (float) – Scheduler gamma.
cap (bool) – Cap generated values to observed range.
random_seed (int) – Random seed for reproducibility.
CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE (512→256→128→64 instead of 256→128→64). Suitable for high-dimensional data like RNA. Ignored for non-CVAE models (default: False).
output_dir (str, Path, or None) – If set, automatically save results to this directory.
verbose (int or str) –
Verbosity level for training output.
- "silent" or 0 — no output during training.
- "minimal" or 1 (default) — print only training summaries and early-stopping messages.
- "detailed" or 2 — print per-epoch progress (epoch number, loss values, elapsed time, learning rate).

Returns:

Rich result object containing generated data, loss log, reconstructed data (AE/VAE/CVAE), model state, and metadata.

Return type:

SyngResult

Examples 

from syng_bts import generate

# Generate synthetic data using a bundled dataset
result = generate(
    data="SKCMPositive_4",
    model="VAE1-10",
    new_size=500,
    batch_frac=0.1,
    learning_rate=0.0005,
)

# Access results
print(result.generated_data.shape)  # (500, n_features)
print(result.summary())

# Plot training loss (one figure per loss column)
figs = result.plot_loss()

# Save to disk
result.save("./my_output/")

import pandas as pd
from syng_bts import generate

# Use your own DataFrame
my_data = pd.read_csv("my_dataset.csv")
result = generate(
    data=my_data,
    name="my_dataset",
    model="WGANGP",
    new_size=1000,
    epoch=50,
)

pilot_study 

Run pilot studies to evaluate models across multiple pilot sizes. For each pilot size, five random sub-samples are drawn and a model is trained on each.

syng_bts.pilot_study(data: DataFrame | str | Path, pilot_size: list[int], *, name: str | None = None, groups: Series | ndarray | None = None, n_draws: int = 5, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') → PilotResult[source]

Sweep over pilot sizes with replicated random draws.

For each pilot size, n_draws random sub-samples are drawn from the original data. A model is trained on each sub-sample and synthetic data equal to n_draws times the sub-sample size is generated.

This replaces the legacy PilotExperiment function.

Parameters:

data (DataFrame, str, or Path) – Input data.
pilot_size (list[int]) – List of pilot sizes to evaluate.
name (str or None) – Short name for output filenames.
groups (pd.Series, np.ndarray, or None) – Optional binary group labels. When provided, these labels take precedence over bundled dataset groups.
n_draws (int) – Number of replicated random draws per pilot size (default: 5). Must be a positive integer.
model (str) – Model specification (e.g. "VAE1-10").
apply_log (bool) – Apply log2(x + 1) preprocessing.
batch_frac (float) – Batch size as a fraction of sample count.
learning_rate (float) – Optimizer learning rate.
epoch (int or None) – Fixed epoch count or None for early stopping. See generate() for the full interaction table.
early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When None and epoch is also None, defaults to 30. See generate() for the full interaction table.
off_aug (str or None) – Offline augmentation mode.
AE_head_num (int) – Fold multiplier for AE-head augmentation.
Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.
random_seed (int) – Base random seed for reproducibility.
CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE — see generate().
output_dir (str, Path, or None) – If set, automatically save results to this directory.
verbose (int or str) – Verbosity level — see generate() for details.

Returns:

Wrapper containing one SyngResult per (pilot_size, draw).

Return type:

PilotResult

Examples 

from syng_bts import pilot_study

# Evaluate VAE across different pilot sizes
pilot = pilot_study(
    data="SKCMPositive_4",
    pilot_size=[50, 100, 200],
    model="VAE1-10",
    batch_frac=0.1,
    learning_rate=0.0005,
)

# Access individual run results
run = pilot.runs[(100, 1)]  # (pilot_size, draw_index)
print(run.generated_data.shape)

# All runs overlaid (one figure per loss column)
figs = pilot.plot_loss(style="overlay_runs")

# Mean ± std across runs
figs = pilot.plot_loss(style="mean_band")

# Save all results
pilot.save("./pilot_output/")

from syng_bts import pilot_study

# Using custom data with CVAE
pilot = pilot_study(
    data="BRCASubtypeSel_train",
    pilot_size=[50, 100],
    model="CVAE1-20",
    epoch=100,
    output_dir="./results/",
)

transfer 

Transfer learning: pre-train on a source dataset, then fine-tune and generate on a target dataset.

Train on source data, then fine-tune and generate on target data.

The model is first trained on source_data and its learned state is kept in-memory, then fine-tuned on target_data. This is a single-run operation returning a SyngResult.

This replaces the legacy TransferExperiment function.

Parameters:

source_data (DataFrame, str, or Path) – Pre-training dataset.
target_data (DataFrame, str, or Path) – Fine-tuning / target dataset.
source_name (str or None) – Short name for the source dataset.
target_name (str or None) – Short name for the target dataset.
source_groups (pd.Series, np.ndarray, or None) – Optional binary groups for the source dataset.
target_groups (pd.Series, np.ndarray, or None) – Optional binary groups for the target dataset.
new_size (int or list[int]) –
Generation size for the fine-tuned target model.
- If int: generate exactly new_size samples.
  For grouped data, counts are split by the target input group ratio and rounded to integers.
- If list[int]: explicit grouped counts
  [n_group_0, n_group_1].
For grouped data, group_0 is the base group used by create_labels() (first encountered group value) and group_1 is the other group.
model (str) – Model specification.
apply_log (bool) – Apply log2 preprocessing.
batch_frac (float) – Batch fraction.
learning_rate (float) – Learning rate.
epoch (int or None) – Fixed epoch count, or None for early stopping. See generate() for the full interaction table.
early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When None and epoch is also None, defaults to 30. See generate() for the full interaction table.
off_aug (str or None) – Offline augmentation mode.
AE_head_num (int) – Fold multiplier for AE-head augmentation.
Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.
random_seed (int) – Random seed.
CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE — see generate().
output_dir (str, Path, or None) – If set, save results here.
verbose (int or str) – Verbosity level — see generate() for details.

Returns:

Result from the fine-tuned target-phase model.

Return type:

SyngResult

Examples 

from syng_bts import transfer

# Transfer from PRAD to BRCA using MAF
result = transfer(
    source_data="PRAD",
    target_data="BRCA",
    new_size=500,
    model="maf",
    apply_log=True,
    epoch=10,
)

print(result.generated_data.shape)
result.save("./transfer_output/")

transfer() is a single-run operation and always returns a SyngResult. For pilot sweeps over target sample sizes, use pilot_study().

Choosing a Model 

SyNG-BTS supports several generative models:

Model	Description
`VAE1-10`	VAE with 1:10 loss ratio
`VAE1-20`	VAE with 1:20 loss ratio
`CVAE1-10`	Conditional VAE (1:10)
`CVAE1-20`	Conditional VAE (1:20)
`GAN`	Standard GAN
`WGANGP`	Wasserstein GAN-GP
`maf`	Masked Autoregressive Flow

See Configuration Reference for detailed parameter descriptions.