API Reference

This page documents the complete public API of SyNG-BTS.

Experiment Functions

These are the main entry points for training generative models and producing synthetic data. All three functions accept data as a pandas DataFrame, a CSV file path, or a bundled dataset name, and return rich result objects.

generate

syng_bts.generate(data: DataFrame | str | Path, *, name: str | None = None, groups: Series | ndarray | None = None, new_size: int | list[int] = 500, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, val_ratio: float = 0.2, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, use_scheduler: bool = False, step_size: int = 10, gamma: float = 0.5, cap: bool = False, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') SyngResult[source]

Train a deep generative model and generate synthetic data.

This is the primary entry point for training a single model and generating synthetic samples. It replaces the legacy ApplyExperiment function.

Parameters:
  • data (DataFrame, str, or Path) – Input data — a pandas DataFrame, a path to a CSV file, or the name of a bundled dataset (e.g. "SKCMPositive_4").

  • name (str or None) – Short name for output filenames. Derived automatically when None.

  • groups (pd.Series, np.ndarray, or None) – Optional binary group labels. When provided, these labels take precedence over bundled dataset groups.

  • new_size (int or list[int]) –

    Generation size.

    • If int: generate exactly new_size samples.

      For grouped data, counts are split by the input group ratio and rounded to integers.

    • If list[int]: explicit grouped counts

      [n_group_0, n_group_1].

    For grouped data, group_0 is the base group used by create_labels() (first encountered group value) and group_1 is the other group.

  • model (str) – Model specification, e.g. "VAE1-10" (parsed into model type and kl_weight).

  • apply_log (bool) – Apply log2(x + 1) preprocessing.

  • batch_frac (float) – Batch size as a fraction of sample count.

  • learning_rate (float) – Optimizer learning rate.

  • epoch (int or None) –

    Fixed epoch count, or None for early stopping.

    The interaction between epoch and early_stop_patience:

    epoch

    early_stop_patience

    Behaviour

    None

    None

    Early stopping ON, patience=30, max 1 000 epochs

    None

    30

    Early stopping ON, patience=30, max 1 000 epochs

    500

    None

    Early stopping OFF, run exactly 500 epochs

    500

    30

    Early stopping ON, patience=30, max 500 epochs

  • val_ratio (float) – Validation split ratio (AE family only).

  • early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When None and epoch is also None, defaults to 30.

  • off_aug (str or None) – Offline augmentation: "AE_head", "Gaussian_head", or None.

  • AE_head_num (int) – Fold multiplier for AE-head augmentation.

  • Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.

  • use_scheduler (bool) – Enable learning-rate scheduler (AE family).

  • step_size (int) – Scheduler step size.

  • gamma (float) – Scheduler gamma.

  • cap (bool) – Cap generated values to observed range.

  • random_seed (int) – Random seed for reproducibility.

  • CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE (512→256→128→64 instead of 256→128→64). Suitable for high-dimensional data like RNA. Ignored for non-CVAE models (default: False).

  • output_dir (str, Path, or None) – If set, automatically save results to this directory.

  • verbose (int or str) –

    Verbosity level for training output.

    • "silent" or 0 — no output during training.

    • "minimal" or 1 (default) — print only training summaries and early-stopping messages.

    • "detailed" or 2 — print per-epoch progress (epoch number, loss values, elapsed time, learning rate).

Returns:

Rich result object containing generated data, loss log, reconstructed data (AE/VAE/CVAE), model state, and metadata.

Return type:

SyngResult

pilot_study

syng_bts.pilot_study(data: DataFrame | str | Path, pilot_size: list[int], *, name: str | None = None, groups: Series | ndarray | None = None, n_draws: int = 5, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') PilotResult[source]

Sweep over pilot sizes with replicated random draws.

For each pilot size, n_draws random sub-samples are drawn from the original data. A model is trained on each sub-sample and synthetic data equal to n_draws times the sub-sample size is generated.

This replaces the legacy PilotExperiment function.

Parameters:
  • data (DataFrame, str, or Path) – Input data.

  • pilot_size (list[int]) – List of pilot sizes to evaluate.

  • name (str or None) – Short name for output filenames.

  • groups (pd.Series, np.ndarray, or None) – Optional binary group labels. When provided, these labels take precedence over bundled dataset groups.

  • n_draws (int) – Number of replicated random draws per pilot size (default: 5). Must be a positive integer.

  • model (str) – Model specification (e.g. "VAE1-10").

  • apply_log (bool) – Apply log2(x + 1) preprocessing.

  • batch_frac (float) – Batch size as a fraction of sample count.

  • learning_rate (float) – Optimizer learning rate.

  • epoch (int or None) – Fixed epoch count or None for early stopping. See generate() for the full interaction table.

  • early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When None and epoch is also None, defaults to 30. See generate() for the full interaction table.

  • off_aug (str or None) – Offline augmentation mode.

  • AE_head_num (int) – Fold multiplier for AE-head augmentation.

  • Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.

  • random_seed (int) – Base random seed for reproducibility.

  • CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE — see generate().

  • output_dir (str, Path, or None) – If set, automatically save results to this directory.

  • verbose (int or str) – Verbosity level — see generate() for details.

Returns:

Wrapper containing one SyngResult per (pilot_size, draw).

Return type:

PilotResult

transfer

syng_bts.transfer(source_data: DataFrame | str | Path, target_data: DataFrame | str | Path, *, source_name: str | None = None, target_name: str | None = None, source_groups: Series | ndarray | None = None, target_groups: Series | ndarray | None = None, new_size: int | list[int] = 500, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') SyngResult[source]

Train on source data, then fine-tune and generate on target data.

The model is first trained on source_data and its learned state is kept in-memory, then fine-tuned on target_data. This is a single-run operation returning a SyngResult.

This replaces the legacy TransferExperiment function.

Parameters:
  • source_data (DataFrame, str, or Path) – Pre-training dataset.

  • target_data (DataFrame, str, or Path) – Fine-tuning / target dataset.

  • source_name (str or None) – Short name for the source dataset.

  • target_name (str or None) – Short name for the target dataset.

  • source_groups (pd.Series, np.ndarray, or None) – Optional binary groups for the source dataset.

  • target_groups (pd.Series, np.ndarray, or None) – Optional binary groups for the target dataset.

  • new_size (int or list[int]) –

    Generation size for the fine-tuned target model.

    • If int: generate exactly new_size samples.

      For grouped data, counts are split by the target input group ratio and rounded to integers.

    • If list[int]: explicit grouped counts

      [n_group_0, n_group_1].

    For grouped data, group_0 is the base group used by create_labels() (first encountered group value) and group_1 is the other group.

  • model (str) – Model specification.

  • apply_log (bool) – Apply log2 preprocessing.

  • batch_frac (float) – Batch fraction.

  • learning_rate (float) – Learning rate.

  • epoch (int or None) – Fixed epoch count, or None for early stopping. See generate() for the full interaction table.

  • early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When None and epoch is also None, defaults to 30. See generate() for the full interaction table.

  • off_aug (str or None) – Offline augmentation mode.

  • AE_head_num (int) – Fold multiplier for AE-head augmentation.

  • Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.

  • random_seed (int) – Random seed.

  • CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE — see generate().

  • output_dir (str, Path, or None) – If set, save results here.

  • verbose (int or str) – Verbosity level — see generate() for details.

Returns:

Result from the fine-tuned target-phase model.

Return type:

SyngResult

Result Objects

Experiment functions return result objects that carry generated data, loss logs, reconstructed data, and model state as attributes.

SyngResult

class syng_bts.SyngResult(generated_data: DataFrame, loss: DataFrame, reconstructed_data: DataFrame | None = None, original_data: DataFrame | None = None, model_state: dict[str, ~typing.Any] | None=None, metadata: dict[str, ~typing.Any]=<factory>, original_groups: Series | None = None, generated_groups: Series | None = None, reconstructed_groups: Series | None = None)[source]

Bases: object

Result of a single SyNG-BTS model training and generation run.

generated_data

Synthetic samples with the original column names preserved.

Type:

pd.DataFrame

loss

Training loss log (columns depend on the model family).

Type:

pd.DataFrame

reconstructed_data

Reconstructions of the input data (AE/VAE/CVAE only).

Type:

pd.DataFrame or None

original_data

The full original input data.

Type:

pd.DataFrame or None

model_state

The state_dict() of the trained model, suitable for torch.save() / torch.load().

Type:

dict or None

metadata

Run parameters and summary statistics, e.g. model name, kl_weight, seed, epoch count, input data dimensions.

Type:

dict

original_groups

Group labels for the original input data. Populated when groups were provided or bundled with the dataset.

Type:

pd.Series or None

generated_groups

Group labels for the generated data, derived from the label column produced during generation and mapped back to the original group values.

Type:

pd.Series or None

reconstructed_groups

Group labels for the reconstructed data (AE/VAE/CVAE only), derived from the label column and mapped back to original group values.

Type:

pd.Series or None

Examples

>>> result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5)
>>> result.generated_data.head()
>>> result.save("./my_output/")
>>> figs = result.plot_loss()  # dict[str, Figure]
generate_new_samples(n: int, *, mode: str = 'new') SyngResult[source]

Generate new synthetic samples from the trained model.

This method reuses the same generation and post-processing path as generate(), applying the same inverse-log transform and column naming.

Parameters:
  • n (int) – Number of new samples to generate.

  • mode (str) –

    How to incorporate the new samples:

    • "new" (default): return a new SyngResult whose generated_data contains only the newly generated samples. All other fields (loss, metadata, model_state, etc.) are copied from self.

    • "overwrite": replace self.generated_data with the new samples and return self.

    • "append": append the new samples to self.generated_data and return self.

Returns:

The result containing the new samples (see mode).

Return type:

SyngResult

Raises:

ValueError – If model_state is None, arch_params is missing from metadata, or mode is not one of the accepted values.

Examples

>>> result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5)
>>> new_result = result.generate_new_samples(200)
>>> new_result.generated_data.shape[0]
200
>>> # After save/load round-trip:
>>> loaded = SyngResult.load("output/")
>>> more = loaded.generate_new_samples(100, mode="append")
>>> more.generated_data.shape[0]  # original + 100
save(output_dir: str | Path, prefix: str | None = None) dict[str, Path][source]

Save all non-None results to output_dir.

Files are written into a single flat directory. CSVs include column headers. Model state is saved as a .pt file. Metadata is written as a human-readable JSON file.

Parameters:
  • output_dir (str or Path) – Directory to write files into (created if it does not exist).

  • prefix (str or None) – Optional filename prefix. When None, uses metadata["dataname"] if available, otherwise "syng".

Returns:

Mapping of output type ("generated", "loss", "reconstructed", "model", "metadata") to the written file path.

Return type:

dict[str, Path]

plot_loss(running_average_window: int = 25, x_axis: str = 'epochs') dict[str, matplotlib.pyplot.Figure][source]

Plot the training loss curve(s), one figure per loss column.

Each returned figure shows the raw loss series (alpha=0.4) and a running-average overlay.

Parameters:
  • running_average_window (int) – Window size for the running-average overlay. Must be > 0. Default: 25.

  • x_axis (str) – "epochs" (default) maps the x-axis to epoch space using metadata["epochs_trained"] (must be present and > 0). "iterations" numbers data points 0…N-1.

Returns:

{loss_column_name: figure} for every column in self.loss.

Return type:

dict[str, matplotlib.figure.Figure]

Raises:

ValueError – If running_average_window ≤ 0, if x_axis is not "iterations" or "epochs", if x_axis="epochs" but metadata["epochs_trained"] is missing or ≤ 0, or if the window is larger than a loss series.

plot_heatmap(which: str = 'generated', log_scale: bool = True) matplotlib.pyplot.Figure[source]

Render a seaborn heatmap of generated or reconstructed data.

Parameters:
  • which (str) – "generated", "reconstructed", or "original".

  • log_scale (bool) – If True (default), apply log2(x + 1) scaling to the data before plotting. This compresses wide-ranging values and often produces more readable heatmaps.

Returns:

The heatmap figure (not shown; caller decides when to display).

Return type:

matplotlib.figure.Figure

Raises:

ValueError – If which is "reconstructed" but no reconstructed data exists, or if which is not a recognised value.

summary() str[source]

Return a short textual summary of this result.

Returns:

A paragraph describing the run dimensions, epoch count, and final loss values.

Return type:

str

classmethod load(directory: str | Path, prefix: str | None = None) SyngResult[source]

Load a previously saved SyngResult from disk.

Parameters:
  • directory (str or Path) – Directory that contains the saved files.

  • prefix (str or None) – The filename stem (everything before _generated.csv). When None, auto-detected from *_generated.csv files in the directory; exactly one match is required.

Returns:

Reconstructed result with all available artifacts.

Return type:

SyngResult

Raises:
  • FileNotFoundError – If the required *_generated.csv or *_loss.csv file is missing.

  • ValueError – If prefix is None and zero or more than one *_generated.csv file is found (ambiguous).

PilotResult

class syng_bts.PilotResult(runs: dict[tuple[int, int], ~syng_bts.result.SyngResult] = <factory>, original_data: ~pandas.DataFrame | None = None, metadata: dict[str, ~typing.Any] = <factory>)[source]

Bases: object

Result of a pilot study run across multiple pilot sizes and draws.

runs

Mapping of (pilot_size, draw_index) → individual run result. draw_index is 1-based (1 through 5).

Type:

dict[tuple[int, int], SyngResult]

original_data

The full original input data (before subsetting).

Type:

pd.DataFrame or None

metadata

Shared metadata across all runs (model, data dimensions, etc.).

Type:

dict

Examples

>>> result = pilot_study(data="SKCMPositive_4", pilot_size=[50, 100], ...)
>>> result.runs[(50, 1)].generated_data.head()
>>> result.save("./pilot_output/")
save(output_dir: str | Path, prefix: str | None = None) dict[tuple[int, int], dict[str, Path]][source]

Save all individual run results to output_dir.

Each run is saved with a filename that encodes the pilot size and draw index.

Parameters:
  • output_dir (str or Path) – Directory to write files into (created if it does not exist).

  • prefix (str or None) – Optional filename prefix. Falls back to metadata["dataname"] or "syng".

Returns:

Nested mapping: (pilot_size, draw) {output_type path}.

Return type:

dict[tuple[int, int], dict[str, Path]]

plot_loss(style: str = 'overlay_runs', running_average_window: int = 25, x_axis: str = 'epochs', truncate: bool = True) dict[tuple[int, int], dict[str, matplotlib.pyplot.Figure]] | dict[str, matplotlib.pyplot.Figure][source]

Plot loss curves for every run in the pilot study.

Parameters:
  • style (str) –

    Plotting style for loss trajectories.

    • "per_run" (default): one figure per run per loss column, delegating to SyngResult.plot_loss().

    • "overlay_runs": overlay all runs on the same plot for each loss column. Only the running-average line is drawn per run (no raw trace) to keep the plot readable.

    • "mean_band": plot the mean loss trajectory across all runs for each loss column, with a shaded ±1 std band. Mean and std are computed on raw loss values; the mean line is then optionally smoothed with a running average.

    For all styles, y-axis scaling is applied to reduce the effect of large initial spikes (analogous to SyngResult.plot_loss()).

  • running_average_window (int) – Window size for the running-average overlay. Must be > 0. Default: 25.

  • x_axis (str) – "epochs" (default) maps the x-axis to epoch space using each run’s metadata["epochs_trained"]. "iterations" numbers data points 0…N-1.

  • truncate (bool) – Only relevant for style="mean_band" and style="overlay_runs". When True (default), only epochs/iterations common to all runs are plotted (truncated to the shortest run). When False, all epochs/iterations are plotted; statistics are computed from whichever runs still have data at each point.

Returns:

style="per_run": nested dict keyed by (pilot_size, draw){column: Figure}. style="overlay_runs" or style="mean_band": flat dict {column: Figure}.

Return type:

dict[tuple[int, int], dict[str, Figure]] or dict[str, Figure]

Raises:

ValueError – If style is not one of the accepted values, if running_average_window ≤ 0, or if x_axis is invalid.

Examples

>>> figs = pilot_result.plot_loss(style="overlay_runs")
>>> figs = pilot_result.plot_loss(style="mean_band", truncate=False)
summary() str[source]

Return an aggregate summary of all pilot runs.

Returns:

Multi-line summary with one line per run.

Return type:

str

Evaluation Functions

Functions for evaluating and visualizing generated data.

heatmap_eval

syng_bts.heatmap_eval(real_data: DataFrame, generated_data: DataFrame | None = None, *, apply_log: bool = True, cmap: str = 'YlGnBu') matplotlib.figure.Figure[source]

Create a heatmap visualization comparing real and generated data.

If only one dataset is provided, displays a single heatmap. If both real and generated data are provided, displays them side by side.

Parameters:
  • real_data (pd.DataFrame) – The original/real data.

  • generated_data (pd.DataFrame or None, optional) – The generated/synthetic data. If None, only real_data is plotted.

  • apply_log (bool, default True) – Whether to apply log2(x + 1) transformation to both real and generated data before visualization.

  • cmap (str, default "YlGnBu") – Colormap passed to seaborn.heatmap().

Returns:

The matplotlib Figure containing the heatmap(s).

Return type:

Figure

UMAP_eval

syng_bts.UMAP_eval(real_data: DataFrame, generated_data: DataFrame | None = None, *, apply_log: bool = True, groups_real: Series | None = None, groups_generated: Series | None = None, random_seed: int = 42, legend_pos: str = 'best') matplotlib.figure.Figure[source]

Create a UMAP visualization comparing real and generated data.

Uses UMAP dimensionality reduction to visualize high-dimensional data in 2D, with optional group colouring.

Parameters:
  • real_data (pd.DataFrame) – The original/real data.

  • generated_data (pd.DataFrame or None, optional) – The generated/synthetic data. If None, only real_data is visualised.

  • apply_log (bool, default True) – Whether to apply log2(x + 1) transformation to both real and generated data before dimensionality reduction.

  • groups_real (pd.Series or None, optional) – Group labels for real samples. Used for styling.

  • groups_generated (pd.Series or None, optional) – Group labels for generated samples. Used for styling.

  • random_seed (int, default 42) – Random seed for UMAP reproducibility.

  • legend_pos (str, default "best") – Legend position ("best", "upper right", "lower left", …).

Returns:

The matplotlib Figure containing the UMAP scatter plot.

Return type:

Figure

evaluation

syng_bts.evaluation(real_data: DataFrame | str | Path, generated_data: DataFrame | str | Path, *, real_groups: Series | ndarray | list | tuple | Index | None = None, generated_groups: Series | ndarray | list | tuple | Index | None = None, n_samples: int | None = 200, apply_log: bool = True, random_seed: int = 42) dict[str, matplotlib.figure.Figure][source]

Preprocessing and visualization of generated vs real data.

Loads and preprocesses the input data, then creates heatmap and UMAP visualizations comparing generated and real datasets.

Parameters:
  • real_data (pd.DataFrame, str, or Path) – The original/real dataset. Accepts a DataFrame, a file path, or a bundled dataset name (resolved via resolve_data()).

  • generated_data (pd.DataFrame, str, or Path) – The generated/synthetic dataset. Same input types as real_data.

  • real_groups (pd.Series, np.ndarray, list, tuple, pd.Index, or None, optional) – Group labels for the real samples. When provided, takes precedence over any bundled groups resolved from real_data. Values are used as-is for plot labels (converted to str).

  • generated_groups (pd.Series, np.ndarray, list, tuple, pd.Index, or None, optional) – Group labels for the generated samples. When provided, takes precedence over any bundled groups resolved from generated_data. Values are used as-is for plot labels (converted to str).

  • n_samples (int or None, default 200) – Number of samples from each end of the dataset to use for visualization (to keep UMAP fast). If None, all samples are used.

  • apply_log (bool, default True) – Whether to apply log2(x + 1) transformation to both real and generated data before comparison.

  • random_seed (int, default 42) – Random seed for UMAP reproducibility.

Returns:

{"heatmap": <Figure>, "umap": <Figure>} — the two evaluation figures. Neither figure has been displayed; the caller decides when to call plt.show() or fig.savefig().

Return type:

dict[str, Figure]

Sample-Size Evaluation (SyntheSize)

Classifier-based sample-size evaluation using inverse power-law learning curves. See Sample-Size Evaluation (SyntheSize) for full usage guide.

evaluate_sample_sizes

syng_bts.evaluate_sample_sizes(data: pd.DataFrame | SyngResult, sample_sizes: list[int] | np.ndarray | pd.Series | int, groups: np.ndarray | pd.Series | list | None = None, which: str = 'generated', n_draws: int = 5, apply_log: bool = True, methods: list[str] | None = None, verbose: int | str = 'minimal') pd.DataFrame[source]

Evaluate classifiers across candidate sample sizes.

For each classifier and each candidate sample size, performs n_draws rounds of stratified sampling (proportional to class distribution), applies 5-fold cross-validation, and averages metrics across folds.

Parameters:
  • data (pd.DataFrame or SyngResult) – The dataset to evaluate. When a SyngResult is provided, the which parameter selects the data attribute and groups are auto-resolved from the corresponding *_groups field.

  • sample_sizes (list[int], np.ndarray, pd.Series, or int) – Candidate sample sizes to evaluate. Accepts a list, numpy array, or pandas Series of positive integers. When a single int is provided it is interpreted as the number of equidistant sizes to create — the maximum equals the number of data rows. For example, sample_sizes=3 with 15-row data produces [5, 10, 15].

  • groups (array-like or None) – Class labels corresponding to the rows of data. Required when data is a pd.DataFrame. When provided alongside a SyngResult, overrides the auto-resolved groups.

  • which (str, default "generated") – Selector when data is a SyngResult: "generated", "original", or "reconstructed".

  • n_draws (int, default 5) – Number of resampling repetitions for each sample size.

  • apply_log (bool, default True) – When True, a log2(x + 1) transform is applied to the data before evaluation.

  • methods (list[str] or None) – Classifier names to evaluate. Accepts canonical names ('LOGIS', 'SVM', 'KNN', 'RF', 'XGB') and common aliases ('LOGISTIC', 'LR', 'RANDOM_FOREST', 'XGBOOST'). Defaults to all five classifiers.

  • verbose (int or str, default "minimal") – Controls output verbosity. Accepts 0 / "silent" (no output), 1 / "minimal" (one dynamic overall progress bar across all sample sizes, draws, and methods), or 2 / "detailed" (per-draw/method metric lines).

Returns:

Columns: total_size, draw, method, f1_score, accuracy, auc.

Return type:

pd.DataFrame

Raises:
  • TypeError – If data is not a pd.DataFrame or SyngResult.

  • ValueError – If groups is missing when required, which is invalid, methods contains unknown names, sample_sizes is empty or contains non-positive values, or any sample size exceeds the number of available rows.

Examples

Using a DataFrame:

>>> df = pd.read_csv("mydata.csv")
>>> groups = df.pop("group")
>>> result = evaluate_sample_sizes(df, sample_sizes=[50, 100], groups=groups)

Using a SyngResult:

>>> from syng_bts import generate
>>> sr = generate(data="BRCASubtypeSel_test", model="CVAE1-20", epoch=10)
>>> result = evaluate_sample_sizes(sr, sample_sizes=[50], which="generated")

plot_sample_sizes

syng_bts.plot_sample_sizes(metric_real: DataFrame, n_target: int | list, metric_generated: DataFrame | None = None, metric_name: str = 'f1_score') matplotlib.pyplot.Figure[source]

Visualize IPLF learning curves fitted from evaluation metrics.

Fits inverse power-law curves to the evaluation metrics produced by evaluate_sample_sizes() and plots observed values, fitted curves, and 95% confidence intervals.

The returned figure is never displayed automatically — call fig.savefig(...) or plt.show() explicitly to display or save.

Parameters:
  • metric_real (pd.DataFrame) – Metrics from evaluate_sample_sizes() on real data.

  • n_target (int or list) – Target sample sizes for extrapolation reference.

  • metric_generated (pd.DataFrame or None) – Metrics from evaluate_sample_sizes() on generated data. When provided, a second column of panels is added.

  • metric_name (str, default "f1_score") – Metric to visualize ("f1_score", "accuracy", or "auc").

Returns:

The figure containing the learning-curve panels.

Return type:

matplotlib.figure.Figure

Examples

>>> metrics = evaluate_sample_sizes(df, [50, 100, 200], groups=g)
>>> fig = plot_sample_sizes(metrics, n_target=300)
>>> fig.savefig("learning_curves.png")

Data Utilities

Functions for loading and managing datasets.

resolve_data

syng_bts.resolve_data(data: DataFrame | str | Path) tuple[DataFrame, Series | None][source]

Resolve a flexible data input to a pandas DataFrame and optional groups.

Accepts a DataFrame (returned as-is with None groups), a file path (loaded via pd.read_csv / pd.read_parquet), or the name of a bundled dataset.

Parameters:

data (pd.DataFrame, str, or Path) –

One of:

  • A pd.DataFrame — returned directly with groups None.

  • A str or Path pointing to an existing CSV or Parquet file (must include an extension such as .csv or .parquet).

  • A plain name (no extension, no path separators) of a bundled dataset, e.g. "SKCMPositive_4".

Returns:

(features_df, groups_or_none). Groups are a pd.Series only when the input is a bundled dataset that ships with a groups sidecar. For user-provided files and DataFrames, groups are always None.

Return type:

tuple[pd.DataFrame, pd.Series | None]

Raises:
  • ValueError – If data looks like a bundled-dataset name but is not found in the registry. The error message lists all available bundled datasets.

  • FileNotFoundError – If data looks like a file path but the file does not exist.

  • TypeError – If data is not a DataFrame, str, or Path.

Examples

>>> from syng_bts.data_utils import resolve_data
>>> df, groups = resolve_data("SKCMPositive_4")          # bundled
>>> df, groups = resolve_data("./my_data/custom.csv")    # file path
>>> df, groups = resolve_data(existing_dataframe)         # pass-through

list_bundled_datasets

syng_bts.list_bundled_datasets() list[source]

List all available bundled datasets.

Returns:

List of dataset names that can be loaded with resolve_data().

Return type:

list

TCGA Datasets

The TCGA loader downloads, caches, and exposes 24 packaged TCGA miRNA cohorts (raw + normalized + CVAE-synthesized). See TCGA Datasets for the narrative guide and TCGA Quick Start for a runnable example.

load_tcga_dataset

syng_bts.load_tcga_dataset(name: str, *, force: bool = False, manifest_url: str | None = None) TCGADataset[source]

Download (if needed) and return a TCGA cohort as a TCGADataset.

On first call for a given dataset, the loader fetches the manifest, downloads the corresponding HDF5 file, verifies its sha256, and caches the file under tcga_cache_dir() / <version>. Subsequent calls reuse the cached file.

Parameters:
  • name – Cohort code (e.g. "BRCA") or full dataset name from the manifest. Cancer-type prefixes resolve to the canonical entry when unambiguous.

  • force – If True, redownload even when a cached file exists.

  • manifest_url – Override the published manifest URL (useful for staging mirrors). Defaults to the data-v1.0 release manifest.

Returns:

A TCGADataset exposing TCGADataset.real() and TCGADataset.synth() accessors.

Raises:
  • ValueError – If name does not match any cohort in the manifest, if the cached HDF5 file is corrupt (pass force=True to redownload), or if the sha256 checksum fails twice after retry.

  • OSError – If the manifest or HDF5 file cannot be downloaded due to a network failure.

Example

>>> from syng_bts import load_tcga_dataset
>>> ds = load_tcga_dataset("BRCA")
>>> real_df, real_groups = ds.real()
>>> real_df.shape
(1144, 570)

list_tcga_datasets

syng_bts.list_tcga_datasets(*, short: bool = False, manifest_url: str | None = None) list[str][source]

Return the names of all TCGA cohorts in the published manifest.

Parameters:
  • short – If True, return short cohort codes (e.g. "BRCA"). Otherwise return the full manifest dataset names.

  • manifest_url – Override the published manifest URL. Defaults to the data-v1.0 release manifest.

Returns:

A list of dataset names. With short=False (default), names are the full manifest entries; with short=True, the leading cohort code.

Example

>>> from syng_bts import list_tcga_datasets
>>> list_tcga_datasets(short=True)[:3]
['ACC', 'BLCA', 'BRCA']

tcga_cache_dir

syng_bts.tcga_cache_dir() Path[source]

Return the active TCGA cache directory (without the version subdir).

Honors the SYNG_BTS_CACHE_DIR environment variable if set; otherwise returns ~/.cache/syng-bts/tcga. The directory is not created by this call.

Returns:

The cache root for TCGA datasets. Versioned dataset files live under tcga_cache_dir() / <manifest-version>.

Example

>>> from syng_bts import tcga_cache_dir
>>> tcga_cache_dir()
PosixPath('/home/alice/.cache/syng-bts/tcga')

clear_tcga_cache

syng_bts.clear_tcga_cache() None[source]

Remove the entire TCGA cache directory.

Deletes tcga_cache_dir() recursively. The next call to load_tcga_dataset() will redownload from the manifest. Use this for full cleanup; for per-dataset redownload, prefer load_tcga_dataset() with force=True.

Example

>>> from syng_bts import clear_tcga_cache
>>> clear_tcga_cache()

TCGADataset

class syng_bts.TCGADataset(*, name: str, cancer_type: str, clinical_variable: str, group_labels: list[str], n_raw_samples: int, n_filtered_samples: int, n_raw_features: int, n_filtered_features: int, schema_version: str, creation_date: str, syng_bts_version: str, raw: Subset, processed: dict[str, Subset], synthetic: dict[str, dict[str, Subset]])[source]

Bases: object

A loaded TCGA cohort with real and synthetic accessors.

Returned by load_tcga_dataset(). Wraps a single HDF5 file containing the raw expression matrix, three normalizations (raw_norm, TC, DESeq), and nine synthetic groups (three CVAE models × three normalizations).

Use real() to access real expression data and synth() to access a synthetic counterpart.

real(normalization: str = 'DESeq') tuple[DataFrame, Series][source]

Return (expression, groups) for one processed normalization.

Parameters:

normalization – One of "raw_norm", "TC", or "DESeq" (default). See syng_bts.tcga.VALID_NORMALIZATIONS.

Returns:

A (expression, groups) tuple where expression is a pandas.DataFrame of shape (n_samples, n_features) and groups is a pandas.Series of group labels aligned to expression.index. To access the per-slice metadata dict, use the underlying Subset directly via dataset.processed[normalization].

Raises:

ValueError – If normalization is not in syng_bts.tcga.VALID_NORMALIZATIONS.

Example

>>> ds = load_tcga_dataset("BRCA")
>>> real_df, real_groups = ds.real("TC")
>>> real_df.shape
(1144, 570)
synth(normalization: str = 'DESeq', model: str = 'CVAE1_5') tuple[DataFrame, Series][source]

Return (expression, groups) for one synthetic configuration.

Parameters:
  • normalization – One of "raw_norm", "TC", or "DESeq" (default). See syng_bts.tcga.VALID_NORMALIZATIONS.

  • model – One of "CVAE1_5" (default), "CVAE1_10", or "CVAE1_20". See syng_bts.tcga.VALID_MODELS.

Returns:

A (expression, groups) tuple where expression is a pandas.DataFrame of shape (n_samples, n_features) and groups is a pandas.Series of group labels aligned to expression.index. To access the per-slice metadata dict (KL weight, epochs trained, etc.), use the underlying Subset directly via dataset.synthetic[normalization][model].

Raises:

ValueError – If normalization is not in syng_bts.tcga.VALID_NORMALIZATIONS or model is not in syng_bts.tcga.VALID_MODELS.

Example

>>> ds = load_tcga_dataset("BRCA")
>>> synth_df, synth_groups = ds.synth("TC", "CVAE1_5")
>>> synth_df.shape
(1000, 570)

Subset

class syng_bts.Subset(expression: DataFrame, groups: Series, metadata: dict[str, Any])[source]

Bases: object

An expression subset returned by TCGADataset accessors.

expression

A pandas.DataFrame of shape (n_samples, n_features). Rows are TCGA samples (indexed by sample barcode); columns are miRNA features.

Type:

pandas.DataFrame

groups

A pandas.Series aligned to expression.index with categorical group labels (e.g. tumor vs. normal).

Type:

pandas.Series

metadata

Dict of HDF5 attributes captured at dataset assembly time (e.g. version, normalization, source).

Type:

dict[str, Any]

expression: DataFrame
groups: Series
metadata: dict[str, Any]
__init__(expression: DataFrame, groups: Series, metadata: dict[str, Any]) None

Model Classes

Advanced users can access the model classes directly.

Note

These classes are for advanced usage. Most users should use the experiment functions (generate, pilot_study, transfer) which handle model creation and training automatically.

AE (Autoencoder)

class syng_bts.AE(*args: Any, **kwargs: Any)[source]

Bases: Module

__init__(num_features)[source]
forward(x)[source]

VAE (Variational Autoencoder)

class syng_bts.VAE(*args: Any, **kwargs: Any)[source]

Bases: Module

__init__(num_features)[source]
reparameterize(z_mu, z_log_var, deterministic=False)[source]
encoding_fn(x, deterministic=False)[source]
forward(x, deterministic=False)[source]

CVAE (Conditional VAE)

class syng_bts.CVAE(*args: Any, **kwargs: Any)[source]

Bases: Module

__init__(num_features, num_classes, wide_network=False)[source]
reparameterize(z_mu, z_log_var, deterministic=False)[source]
encoding_fn(x, y, deterministic=False)[source]
decoding_fn(encoded, y)[source]
forward(x, y, deterministic=False)[source]

GAN (Generative Adversarial Network)

class syng_bts.GAN(*args: Any, **kwargs: Any)[source]

Bases: Module

__init__(num_features, latent_dim=32)[source]
generator_forward(z)[source]
discriminator_forward(img)[source]

Package Information

Version and metadata information.

syng_bts.__version__

The current version of SyNG-BTS.

syng_bts.__author__

The package authors.

syng_bts.__license__

The package license (AGPL-3.0).