Configuration Reference

This page documents all configuration parameters available in SyNG-BTS.

Available Models 

SyNG-BTS supports several deep generative models for data augmentation:

Supported Models
Model Code	Description
`VAE1-10`	Variational Autoencoder with 1:10 reconstruction/KL loss ratio
`VAE1-20`	VAE with 1:20 loss ratio
`CVAE1-10`	Conditional VAE with 1:10 loss ratio
`CVAE1-20`	Conditional VAE with 1:20 loss ratio
`GAN`	Generative Adversarial Network
`WGANGP`	Wasserstein GAN with Gradient Penalty
`maf`	Masked Autoregressive Flow

Common Parameters 

These parameters are shared across all experiment functions (generate, pilot_study, transfer):

Data Parameters 

Parameter	Type	Description
`data`	DataFrame, str, or Path	Input data — a pandas DataFrame, a path to a CSV file, or the name of a bundled dataset (e.g. `"SKCMPositive_4"`).
`name`	str or None	Short name for output filenames. Derived automatically from data when `None`.
`output_dir`	str, Path, or None	If set, save results to this directory. When `None` (default), no files are written — data stays in memory.

Training Parameters 

Parameter	Type	Description
`model`	str	The generative model to use (e.g. `"VAE1-10"`, `"WGANGP"`, `"maf"`)
`batch_frac`	float	Batch size as a fraction of training data (default: 0.1)
`learning_rate`	float	Learning rate for optimizer (default: 0.0005)
`epoch`	int or None	Number of training epochs. If `None`, uses early stopping.
`early_stop_patience`	int or None	Stop if loss does not improve for this many epochs. `None` disables early stopping (requires epoch to be set).
`apply_log`	bool	Apply `log2(x + 1)` preprocessing to data (default: `True`).
`random_seed`	int	Random seed for reproducibility (default: 123).

Generation Parameters 

Parameter

Type

Description

new_size

int or list[int]

Generation size (default: 500).

int: exact total sample count. For grouped data, counts are split by the input group ratio and rounded.
list[int]: explicit grouped counts [n_group_0, n_group_1].

group_0 is the first group value encountered in input data; group_1 is the other group.

pilot_size

list[int]

Sample sizes to evaluate (only for pilot_study()).

n_draws

int

Number of replicated random draws per pilot size (default: 5). Used in pilot_study().

Augmentation Parameters 

Parameter	Type	Description
`off_aug`	str or None	Offline augmentation mode: `"AE_head"`, `"Gaussian_head"`, or `None` (default: `None`).
`AE_head_num`	int	Fold multiplier for AE-head augmentation (default: 2).
`Gaussian_head_num`	int	Fold multiplier for Gaussian-head augmentation (default: 9).

Advanced Parameters (`generate` only)

Parameter	Type	Description
`val_ratio`	float	Validation split ratio for AE family (default: 0.2).
`use_scheduler`	bool	Enable learning-rate scheduler for AE family (default: `False`).
`step_size`	int	Scheduler step size (default: 10).
`gamma`	float	Scheduler gamma (default: 0.5).
`cap`	bool	Cap generated values to observed range (default: `False`).

Model Architecture Parameters 

Parameter	Type	Description
`CVAE_wide_network`	bool	Use a wider encoder/decoder for the CVAE model (default: `False`). When `True`, the encoder uses layers 512 → 256 → 128 → 64 instead of the standard 256 → 128 → 64. The decoder is symmetric. Suitable for high-dimensional data such as RNA expression. Ignored for non-CVAE models.

`generate()` Parameters 

from syng_bts import generate

result = generate(
    data="SKCMPositive_4",       # Data input (required)
    name=None,                   # Output name (auto-derived)
    new_size=500,                # Samples to generate
    model="VAE1-10",             # Model specification
    apply_log=True,              # Log-transform data
    batch_frac=0.1,              # Batch fraction
    learning_rate=0.0005,        # Learning rate
    epoch=None,                  # Epochs (None=early stopping)
    early_stop_patience=None,    # Early stopping patience
    off_aug=None,                # Offline augmentation
    AE_head_num=2,               # AE-head folds
    Gaussian_head_num=9,         # Gaussian-head folds
    use_scheduler=False,         # LR scheduler
    cap=False,                   # Cap generated values
    random_seed=123,             # Random seed
    output_dir=None,             # Output directory
)

`pilot_study()` Parameters 

from syng_bts import pilot_study

result = pilot_study(
    data="SKCMPositive_4",       # Data input (required)
    pilot_size=[50, 100],        # Pilot sizes (required)
    name=None,                   # Output name (auto-derived)
    n_draws=5,                   # Draws per pilot size
    model="VAE1-10",             # Model specification
    batch_frac=0.1,              # Batch fraction
    learning_rate=0.0005,        # Learning rate
    epoch=None,                  # Epochs (None=early stopping)
    early_stop_patience=30,      # Early stopping patience
    off_aug=None,                # Offline augmentation
    AE_head_num=2,               # AE-head folds
    Gaussian_head_num=9,         # Gaussian-head folds
    random_seed=123,             # Random seed
    output_dir=None,             # Output directory
)

`transfer()` Parameters 

from syng_bts import transfer

result = transfer(
    source_data="PRAD",          # Source dataset (required)
    target_data="BRCA",          # Target dataset (required)
    source_name=None,            # Source name (auto-derived)
    target_name=None,            # Target name (auto-derived)
    new_size=500,                # Target generation size
    model="maf",                 # Model specification
    apply_log=True,              # Log-transform data
    batch_frac=0.1,              # Batch fraction
    learning_rate=0.0005,        # Learning rate
    epoch=None,                  # Epochs (None=early stopping)
    early_stop_patience=30,      # Early stopping patience
    off_aug=None,                # Offline augmentation
    random_seed=123,             # Random seed
    output_dir=None,             # Output directory
)

Output and Saving 

In v3.0, no files are written by default. Results stay in memory as SyngResult or PilotResult objects. To persist results to disk, either:

Pass output_dir to the experiment function, or
Call result.save(output_dir) on the returned object.

result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5)

# Option 1: Save later
paths = result.save("./my_output/")
print(paths)
# {'generated': PosixPath('./my_output/SKCMPositive_4_VAE1-10_generated.csv'),
#  'loss': PosixPath('./my_output/SKCMPositive_4_VAE1-10_loss.csv'), ...}

# Option 2: Save automatically
result = generate(
    data="SKCMPositive_4", model="VAE1-10", epoch=5,
    output_dir="./auto_output/",
)

Bundled Datasets 

SyNG-BTS includes several bundled datasets for testing and examples:

from syng_bts import list_bundled_datasets, resolve_data

# List all available datasets
print(list_bundled_datasets())
# ['SKCMPositive_4', 'BRCA', 'PRAD', 'BRCASubtypeSel', ...]

# Load a bundled dataset as a DataFrame
data, groups = resolve_data("SKCMPositive_4")
print(f"Shape: {data.shape}")

Available bundled datasets (see Example Datasets for details):

Examples: SKCMPositive_4
Transfer Learning: BRCA, PRAD
BRCA Subtype: BRCASubtypeSel, BRCASubtypeSel_train, BRCASubtypeSel_test
LIHC Subtype: LIHCSubtypeFamInd, LIHCSubtypeFamInd_DESeq, and more