Configuration Reference ======================= This page documents all configuration parameters available in SyNG-BTS. .. contents:: Table of Contents :local: :depth: 2 Available Models ---------------- SyNG-BTS supports several deep generative models for data augmentation: .. list-table:: Supported Models :header-rows: 1 :widths: 20 50 * - Model Code - Description * - ``VAE1-10`` - Variational Autoencoder with 1:10 reconstruction/KL loss ratio * - ``VAE1-20`` - VAE with 1:20 loss ratio * - ``CVAE1-10`` - Conditional VAE with 1:10 loss ratio * - ``CVAE1-20`` - Conditional VAE with 1:20 loss ratio * - ``GAN`` - Generative Adversarial Network * - ``WGANGP`` - Wasserstein GAN with Gradient Penalty * - ``maf`` - Masked Autoregressive Flow Common Parameters ----------------- These parameters are shared across all experiment functions (``generate``, ``pilot_study``, ``transfer``): Data Parameters ~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Parameter - Type - Description * - ``data`` - DataFrame, str, or Path - Input data — a pandas DataFrame, a path to a CSV file, or the name of a bundled dataset (e.g. ``"SKCMPositive_4"``). * - ``name`` - str or None - Short name for output filenames. Derived automatically from *data* when ``None``. * - ``output_dir`` - str, Path, or None - If set, save results to this directory. When ``None`` (default), no files are written — data stays in memory. Training Parameters ~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Parameter - Type - Description * - ``model`` - str - The generative model to use (e.g. ``"VAE1-10"``, ``"WGANGP"``, ``"maf"``) * - ``batch_frac`` - float - Batch size as a fraction of training data (default: 0.1) * - ``learning_rate`` - float - Learning rate for optimizer (default: 0.0005) * - ``epoch`` - int or None - Number of training epochs. If ``None``, uses early stopping. * - ``early_stop_patience`` - int or None - Stop if loss does not improve for this many epochs. ``None`` disables early stopping (requires *epoch* to be set). * - ``apply_log`` - bool - Apply ``log2(x + 1)`` preprocessing to data (default: ``True``). * - ``random_seed`` - int - Random seed for reproducibility (default: 123). Generation Parameters ~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Parameter - Type - Description * - ``new_size`` - int or list[int] - Generation size (default: 500). - ``int``: exact total sample count. For grouped data, counts are split by the input group ratio and rounded. - ``list[int]``: explicit grouped counts ``[n_group_0, n_group_1]``. ``group_0`` is the first group value encountered in input data; ``group_1`` is the other group. * - ``pilot_size`` - list[int] - Sample sizes to evaluate (only for ``pilot_study()``). * - ``n_draws`` - int - Number of replicated random draws per pilot size (default: 5). Used in ``pilot_study()``. Augmentation Parameters ~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Parameter - Type - Description * - ``off_aug`` - str or None - Offline augmentation mode: ``"AE_head"``, ``"Gaussian_head"``, or ``None`` (default: ``None``). * - ``AE_head_num`` - int - Fold multiplier for AE-head augmentation (default: 2). * - ``Gaussian_head_num`` - int - Fold multiplier for Gaussian-head augmentation (default: 9). Advanced Parameters (``generate`` only) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Parameter - Type - Description * - ``val_ratio`` - float - Validation split ratio for AE family (default: 0.2). * - ``use_scheduler`` - bool - Enable learning-rate scheduler for AE family (default: ``False``). * - ``step_size`` - int - Scheduler step size (default: 10). * - ``gamma`` - float - Scheduler gamma (default: 0.5). * - ``cap`` - bool - Cap generated values to observed range (default: ``False``). Model Architecture Parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 20 15 65 * - Parameter - Type - Description * - ``CVAE_wide_network`` - bool - Use a wider encoder/decoder for the CVAE model (default: ``False``). When ``True``, the encoder uses layers 512 → 256 → 128 → 64 instead of the standard 256 → 128 → 64. The decoder is symmetric. Suitable for high-dimensional data such as RNA expression. Ignored for non-CVAE models. ``generate()`` Parameters ------------------------- .. code-block:: python from syng_bts import generate result = generate( data="SKCMPositive_4", # Data input (required) name=None, # Output name (auto-derived) new_size=500, # Samples to generate model="VAE1-10", # Model specification apply_log=True, # Log-transform data batch_frac=0.1, # Batch fraction learning_rate=0.0005, # Learning rate epoch=None, # Epochs (None=early stopping) early_stop_patience=None, # Early stopping patience off_aug=None, # Offline augmentation AE_head_num=2, # AE-head folds Gaussian_head_num=9, # Gaussian-head folds use_scheduler=False, # LR scheduler cap=False, # Cap generated values random_seed=123, # Random seed output_dir=None, # Output directory ) ``pilot_study()`` Parameters ----------------------------- .. code-block:: python from syng_bts import pilot_study result = pilot_study( data="SKCMPositive_4", # Data input (required) pilot_size=[50, 100], # Pilot sizes (required) name=None, # Output name (auto-derived) n_draws=5, # Draws per pilot size model="VAE1-10", # Model specification batch_frac=0.1, # Batch fraction learning_rate=0.0005, # Learning rate epoch=None, # Epochs (None=early stopping) early_stop_patience=30, # Early stopping patience off_aug=None, # Offline augmentation AE_head_num=2, # AE-head folds Gaussian_head_num=9, # Gaussian-head folds random_seed=123, # Random seed output_dir=None, # Output directory ) ``transfer()`` Parameters -------------------------- .. code-block:: python from syng_bts import transfer result = transfer( source_data="PRAD", # Source dataset (required) target_data="BRCA", # Target dataset (required) source_name=None, # Source name (auto-derived) target_name=None, # Target name (auto-derived) new_size=500, # Target generation size model="maf", # Model specification apply_log=True, # Log-transform data batch_frac=0.1, # Batch fraction learning_rate=0.0005, # Learning rate epoch=None, # Epochs (None=early stopping) early_stop_patience=30, # Early stopping patience off_aug=None, # Offline augmentation random_seed=123, # Random seed output_dir=None, # Output directory ) Output and Saving ----------------- In v3.0, **no files are written by default**. Results stay in memory as ``SyngResult`` or ``PilotResult`` objects. To persist results to disk, either: 1. Pass ``output_dir`` to the experiment function, or 2. Call ``result.save(output_dir)`` on the returned object. .. code-block:: python result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5) # Option 1: Save later paths = result.save("./my_output/") print(paths) # {'generated': PosixPath('./my_output/SKCMPositive_4_VAE1-10_generated.csv'), # 'loss': PosixPath('./my_output/SKCMPositive_4_VAE1-10_loss.csv'), ...} # Option 2: Save automatically result = generate( data="SKCMPositive_4", model="VAE1-10", epoch=5, output_dir="./auto_output/", ) Bundled Datasets ---------------- SyNG-BTS includes several bundled datasets for testing and examples: .. code-block:: python from syng_bts import list_bundled_datasets, resolve_data # List all available datasets print(list_bundled_datasets()) # ['SKCMPositive_4', 'BRCA', 'PRAD', 'BRCASubtypeSel', ...] # Load a bundled dataset as a DataFrame data, groups = resolve_data("SKCMPositive_4") print(f"Shape: {data.shape}") Available bundled datasets (see :doc:`datasets` for details): - **Examples**: ``SKCMPositive_4`` - **Transfer Learning**: ``BRCA``, ``PRAD`` - **BRCA Subtype**: ``BRCASubtypeSel``, ``BRCASubtypeSel_train``, ``BRCASubtypeSel_test`` - **LIHC Subtype**: ``LIHCSubtypeFamInd``, ``LIHCSubtypeFamInd_DESeq``, and more