Synthetic Data Generation
This page documents the core synthetic data generation functions in SyNG-BTS.
Overview
SyNG-BTS provides three core synthetic data generation functions. All accept
data as a pandas DataFrame, a CSV file path, or the name of a bundled dataset,
and return rich result objects (SyngResult or PilotResult).
See Configuration Reference for all available parameters and model choices.
generate()— Train a model and produce synthetic samplespilot_study()— Sweep over pilot sizes with replicated drawstransfer()— Pre-train on source data, fine-tune on target data
generate
Train a generative model on a dataset and generate synthetic samples. This is the primary entry point for single model training.
- syng_bts.generate(data: DataFrame | str | Path, *, name: str | None = None, groups: Series | ndarray | None = None, new_size: int | list[int] = 500, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, val_ratio: float = 0.2, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, use_scheduler: bool = False, step_size: int = 10, gamma: float = 0.5, cap: bool = False, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') SyngResult[source]
Train a deep generative model and generate synthetic data.
This is the primary entry point for training a single model and generating synthetic samples. It replaces the legacy
ApplyExperimentfunction.- Parameters:
data (DataFrame, str, or Path) – Input data — a pandas DataFrame, a path to a CSV file, or the name of a bundled dataset (e.g.
"SKCMPositive_4").name (str or None) – Short name for output filenames. Derived automatically when
None.groups (pd.Series, np.ndarray, or None) – Optional binary group labels. When provided, these labels take precedence over bundled dataset groups.
Generation size.
- If
int: generate exactlynew_sizesamples. For grouped data, counts are split by the input group ratio and rounded to integers.
- If
- If
list[int]: explicit grouped counts [n_group_0, n_group_1].
- If
For grouped data,
group_0is the base group used bycreate_labels()(first encountered group value) andgroup_1is the other group.model (str) – Model specification, e.g.
"VAE1-10"(parsed into model type and kl_weight).apply_log (bool) – Apply
log2(x + 1)preprocessing.batch_frac (float) – Batch size as a fraction of sample count.
learning_rate (float) – Optimizer learning rate.
epoch (int or None) –
Fixed epoch count, or
Nonefor early stopping.The interaction between epoch and early_stop_patience:
epochearly_stop_patienceBehaviour
NoneNoneEarly stopping ON, patience=30, max 1 000 epochs
None30Early stopping ON, patience=30, max 1 000 epochs
500NoneEarly stopping OFF, run exactly 500 epochs
50030Early stopping ON, patience=30, max 500 epochs
val_ratio (float) – Validation split ratio (AE family only).
early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When
Noneandepochis alsoNone, defaults to30.off_aug (str or None) – Offline augmentation:
"AE_head","Gaussian_head", orNone.AE_head_num (int) – Fold multiplier for AE-head augmentation.
Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.
use_scheduler (bool) – Enable learning-rate scheduler (AE family).
step_size (int) – Scheduler step size.
gamma (float) – Scheduler gamma.
cap (bool) – Cap generated values to observed range.
random_seed (int) – Random seed for reproducibility.
CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE (512→256→128→64 instead of 256→128→64). Suitable for high-dimensional data like RNA. Ignored for non-CVAE models (default:
False).output_dir (str, Path, or None) – If set, automatically save results to this directory.
Verbosity level for training output.
"silent"or0— no output during training."minimal"or1(default) — print only training summaries and early-stopping messages."detailed"or2— print per-epoch progress (epoch number, loss values, elapsed time, learning rate).
- Returns:
Rich result object containing generated data, loss log, reconstructed data (AE/VAE/CVAE), model state, and metadata.
- Return type:
Examples
from syng_bts import generate
# Generate synthetic data using a bundled dataset
result = generate(
data="SKCMPositive_4",
model="VAE1-10",
new_size=500,
batch_frac=0.1,
learning_rate=0.0005,
)
# Access results
print(result.generated_data.shape) # (500, n_features)
print(result.summary())
# Plot training loss (one figure per loss column)
figs = result.plot_loss()
# Save to disk
result.save("./my_output/")
import pandas as pd
from syng_bts import generate
# Use your own DataFrame
my_data = pd.read_csv("my_dataset.csv")
result = generate(
data=my_data,
name="my_dataset",
model="WGANGP",
new_size=1000,
epoch=50,
)
pilot_study
Run pilot studies to evaluate models across multiple pilot sizes. For each pilot size, five random sub-samples are drawn and a model is trained on each.
- syng_bts.pilot_study(data: DataFrame | str | Path, pilot_size: list[int], *, name: str | None = None, groups: Series | ndarray | None = None, n_draws: int = 5, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') PilotResult[source]
Sweep over pilot sizes with replicated random draws.
For each pilot size, n_draws random sub-samples are drawn from the original data. A model is trained on each sub-sample and synthetic data equal to n_draws times the sub-sample size is generated.
This replaces the legacy
PilotExperimentfunction.- Parameters:
data (DataFrame, str, or Path) – Input data.
name (str or None) – Short name for output filenames.
groups (pd.Series, np.ndarray, or None) – Optional binary group labels. When provided, these labels take precedence over bundled dataset groups.
n_draws (int) – Number of replicated random draws per pilot size (default: 5). Must be a positive integer.
model (str) – Model specification (e.g.
"VAE1-10").apply_log (bool) – Apply
log2(x + 1)preprocessing.batch_frac (float) – Batch size as a fraction of sample count.
learning_rate (float) – Optimizer learning rate.
epoch (int or None) – Fixed epoch count or
Nonefor early stopping. Seegenerate()for the full interaction table.early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When
Noneandepochis alsoNone, defaults to30. Seegenerate()for the full interaction table.off_aug (str or None) – Offline augmentation mode.
AE_head_num (int) – Fold multiplier for AE-head augmentation.
Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.
random_seed (int) – Base random seed for reproducibility.
CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE — see
generate().output_dir (str, Path, or None) – If set, automatically save results to this directory.
verbose (int or str) – Verbosity level — see
generate()for details.
- Returns:
Wrapper containing one
SyngResultper (pilot_size, draw).- Return type:
PilotResult
Examples
from syng_bts import pilot_study
# Evaluate VAE across different pilot sizes
pilot = pilot_study(
data="SKCMPositive_4",
pilot_size=[50, 100, 200],
model="VAE1-10",
batch_frac=0.1,
learning_rate=0.0005,
)
# Access individual run results
run = pilot.runs[(100, 1)] # (pilot_size, draw_index)
print(run.generated_data.shape)
# All runs overlaid (one figure per loss column)
figs = pilot.plot_loss(style="overlay_runs")
# Mean ± std across runs
figs = pilot.plot_loss(style="mean_band")
# Save all results
pilot.save("./pilot_output/")
from syng_bts import pilot_study
# Using custom data with CVAE
pilot = pilot_study(
data="BRCASubtypeSel_train",
pilot_size=[50, 100],
model="CVAE1-20",
epoch=100,
output_dir="./results/",
)
transfer
Transfer learning: pre-train on a source dataset, then fine-tune and generate on a target dataset.
- syng_bts.transfer(source_data: DataFrame | str | Path, target_data: DataFrame | str | Path, *, source_name: str | None = None, target_name: str | None = None, source_groups: Series | ndarray | None = None, target_groups: Series | ndarray | None = None, new_size: int | list[int] = 500, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') SyngResult[source]
Train on source data, then fine-tune and generate on target data.
The model is first trained on source_data and its learned state is kept in-memory, then fine-tuned on target_data. This is a single-run operation returning a
SyngResult.This replaces the legacy
TransferExperimentfunction.- Parameters:
source_data (DataFrame, str, or Path) – Pre-training dataset.
target_data (DataFrame, str, or Path) – Fine-tuning / target dataset.
source_name (str or None) – Short name for the source dataset.
target_name (str or None) – Short name for the target dataset.
source_groups (pd.Series, np.ndarray, or None) – Optional binary groups for the source dataset.
target_groups (pd.Series, np.ndarray, or None) – Optional binary groups for the target dataset.
Generation size for the fine-tuned target model.
- If
int: generate exactlynew_sizesamples. For grouped data, counts are split by the target input group ratio and rounded to integers.
- If
- If
list[int]: explicit grouped counts [n_group_0, n_group_1].
- If
For grouped data,
group_0is the base group used bycreate_labels()(first encountered group value) andgroup_1is the other group.model (str) – Model specification.
apply_log (bool) – Apply log2 preprocessing.
batch_frac (float) – Batch fraction.
learning_rate (float) – Learning rate.
epoch (int or None) – Fixed epoch count, or
Nonefor early stopping. Seegenerate()for the full interaction table.early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When
Noneandepochis alsoNone, defaults to30. Seegenerate()for the full interaction table.off_aug (str or None) – Offline augmentation mode.
AE_head_num (int) – Fold multiplier for AE-head augmentation.
Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.
random_seed (int) – Random seed.
CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE — see
generate().output_dir (str, Path, or None) – If set, save results here.
verbose (int or str) – Verbosity level — see
generate()for details.
- Returns:
Result from the fine-tuned target-phase model.
- Return type:
Examples
from syng_bts import transfer
# Transfer from PRAD to BRCA using MAF
result = transfer(
source_data="PRAD",
target_data="BRCA",
new_size=500,
model="maf",
apply_log=True,
epoch=10,
)
print(result.generated_data.shape)
result.save("./transfer_output/")
transfer() is a single-run operation and always returns a
SyngResult. For pilot sweeps over target sample sizes, use
pilot_study().
Choosing a Model
SyNG-BTS supports several generative models:
Model |
Description |
|---|---|
|
VAE with 1:10 loss ratio |
|
VAE with 1:20 loss ratio |
|
Conditional VAE (1:10) |
|
Conditional VAE (1:20) |
|
Standard GAN |
|
Wasserstein GAN-GP |
|
Masked Autoregressive Flow |
See Configuration Reference for detailed parameter descriptions.