Usage Guide

This guide covers installation and basic usage of SyNG-BTS.

Installation

Requirements: Python 3.10 or later.

From Source

For development or the latest features:

$ git clone https://github.com/Omics-Data-Synthesis/SyNG-BTS
$ cd SyNG-BTS
$ pip install -e .

Optional Dependencies

Install documentation dependencies:

$ pip install syng-bts[docs]

Install development dependencies (testing, linting):

$ pip install syng-bts[dev]

Install all optional dependencies:

$ pip install syng-bts[all]

Quick Start

Basic Import

After installation, import SyNG-BTS in your Python code:

from syng_bts import (
    generate,
    pilot_study,
    transfer,
    list_bundled_datasets,
    resolve_data,
    SyngResult,
    PilotResult,
)

Browse Bundled Datasets

Use list_bundled_datasets() to browse available bundled datasets, and resolve_data() to load them:

from syng_bts import list_bundled_datasets, resolve_data

# See available datasets
print(list_bundled_datasets())
# ['SKCMPositive_4', 'BRCA', 'PRAD', 'BRCASubtypeSel', ...]

# Load a bundled dataset (returns a tuple of DataFrame and optional groups)
data, groups = resolve_data("SKCMPositive_4")
print(f"Dataset shape: {data.shape}")

Load TCGA Cohorts

For the full 24 TCGA miRNA cohorts (real + CVAE-synthesized, downloaded on demand), use list_tcga_datasets() and load_tcga_dataset():

from syng_bts import list_tcga_datasets, load_tcga_dataset

# Browse available cohorts
print(list_tcga_datasets(short=True))
# ['BLCA', 'BRCA', 'COAD', ..., 'UCS']

# Load a cohort (downloads on first call, then cached locally)
ds = load_tcga_dataset("BRCA")

# Real expression data — DESeq-normalized by default
real_df, real_groups = ds.real()
print(f"Real shape: {real_df.shape}")

# CVAE-synthesized counterpart
synth_df, synth_groups = ds.synth()
print(f"Synthetic shape: {synth_df.shape}")

See TCGA Datasets for the full guide (catalog, normalizations, caching, custom mirrors) and TCGA Quick Start for a runnable end-to-end example.

Generate Synthetic Data

Use generate() to train a generative model and produce synthetic samples (see generate in Synthetic Data Generation):

from syng_bts import generate

result = generate(
    data="SKCMPositive_4",  # bundled dataset name, CSV path, or DataFrame
    model="VAE1-10",
    new_size=500,
    batch_frac=0.1,
    learning_rate=0.0005,
)

# Access results in memory
print(result.generated_data.shape)   # (500, n_features)
print(result.loss.columns.tolist())  # ['kl', 'recons']
print(result.summary())

# Plot training loss (one figure per loss column)
figs = result.plot_loss()
figs["kl"].savefig("kl_loss.png")

# Optionally save to disk
result.save("./my_output/")

# Load a previously saved result
from syng_bts import SyngResult
loaded = SyngResult.load("./my_output/")

For grouped datasets, new_size supports two forms:

  • int: exact total generated sample count. Group counts follow the

    input group ratio (rounded).

  • list[int]: explicit grouped counts [n_group_0, n_group_1].

Here, group_0 is the first group value encountered in the input group labels, and group_1 is the other group.

Run a Pilot Study

Use pilot_study() to sweep over multiple pilot sizes with replicated random draws (see pilot_study in Synthetic Data Generation):

from syng_bts import pilot_study

pilot = pilot_study(
    data="SKCMPositive_4",
    pilot_size=[50, 100],
    model="VAE1-10",
    batch_frac=0.1,
    learning_rate=0.0005,
)

# Access individual runs
run = pilot.runs[(50, 1)]  # (pilot_size, draw_index)
print(run.generated_data.head())

# Save all runs
pilot.save("./pilot_output/")

Use DataFrame Input

Pass your own data as a pandas DataFrame:

import pandas as pd
from syng_bts import generate

my_data = pd.read_csv("my_dataset.csv")
result = generate(
    data=my_data,
    name="my_dataset",  # used in output filenames
    model="WGANGP",
    new_size=1000,
    epoch=50,
)

Evaluate Results

Visualize generated data using plot_heatmap() (on the result object) or the standalone heatmap_eval() and UMAP_eval() functions (see Evaluation Functions):

from syng_bts import generate, heatmap_eval, UMAP_eval, resolve_data

result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5)

# Built-in heatmap on the result object
fig = result.plot_heatmap()

# Standalone evaluation comparing real and generated data
real_data, _groups = resolve_data("SKCMPositive_4")
heatmap_eval(real_data=real_data, generated_data=result.generated_data)
UMAP_eval(real_data=real_data, generated_data=result.generated_data)

Sample-Size Evaluation (SyntheSize)

Evaluate how classifier performance scales with sample size using evaluate_sample_sizes() and plot_sample_sizes() (see Sample-Size Evaluation (SyntheSize) for full details):

By default, evaluate_sample_sizes() applies log2(x + 1) (apply_log=True). Set apply_log=False when input data is already log-transformed.

from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data

# Load real data with group labels
data, groups = resolve_data("BRCASubtypeSel_test")

# Evaluate classifiers at different sample sizes
metrics = evaluate_sample_sizes(
    data=data,
    sample_sizes=[50, 100, 150],
    groups=groups,
    n_draws=5,
)

# Plot inverse power-law learning curves
fig = plot_sample_sizes(metrics, n_target=200)
fig.savefig("learning_curves.png")

You can also pass a SyngResult directly — groups are auto-resolved from the result object:

from syng_bts import generate, evaluate_sample_sizes

result = generate(data="BRCASubtypeSel_train", model="CVAE1-20", epoch=10)
metrics = evaluate_sample_sizes(result, sample_sizes=[50, 100], which="generated")

Next Steps