Usage Guide =========== This guide covers installation and basic usage of SyNG-BTS. .. contents:: Table of Contents :local: :depth: 2 .. _installation: Installation ------------ **Requirements:** Python 3.10 or later. From PyPI (Recommended) ~~~~~~~~~~~~~~~~~~~~~~~ Install SyNG-BTS using pip: .. code-block:: console $ pip install syng-bts From Source ~~~~~~~~~~~ For development or the latest features: .. code-block:: console $ git clone https://github.com/Omics-Data-Synthesis/SyNG-BTS $ cd SyNG-BTS $ pip install -e . Optional Dependencies ~~~~~~~~~~~~~~~~~~~~~ Install documentation dependencies: .. code-block:: console $ pip install syng-bts[docs] Install development dependencies (testing, linting): .. code-block:: console $ pip install syng-bts[dev] Install all optional dependencies: .. code-block:: console $ pip install syng-bts[all] Quick Start ----------- Basic Import ~~~~~~~~~~~~ After installation, import SyNG-BTS in your Python code: .. code-block:: python from syng_bts import ( generate, pilot_study, transfer, list_bundled_datasets, resolve_data, SyngResult, PilotResult, ) Browse Bundled Datasets ~~~~~~~~~~~~~~~~~~~~~~~ Use :func:`~syng_bts.list_bundled_datasets` to browse available bundled datasets, and :func:`~syng_bts.resolve_data` to load them: .. code-block:: python from syng_bts import list_bundled_datasets, resolve_data # See available datasets print(list_bundled_datasets()) # ['SKCMPositive_4', 'BRCA', 'PRAD', 'BRCASubtypeSel', ...] # Load a bundled dataset (returns a tuple of DataFrame and optional groups) data, groups = resolve_data("SKCMPositive_4") print(f"Dataset shape: {data.shape}") Load TCGA Cohorts ~~~~~~~~~~~~~~~~~ For the full 24 TCGA miRNA cohorts (real + CVAE-synthesized, downloaded on demand), use :func:`~syng_bts.list_tcga_datasets` and :func:`~syng_bts.load_tcga_dataset`: .. code-block:: python from syng_bts import list_tcga_datasets, load_tcga_dataset # Browse available cohorts print(list_tcga_datasets(short=True)) # ['BLCA', 'BRCA', 'COAD', ..., 'UCS'] # Load a cohort (downloads on first call, then cached locally) ds = load_tcga_dataset("BRCA") # Real expression data — DESeq-normalized by default real_df, real_groups = ds.real() print(f"Real shape: {real_df.shape}") # CVAE-synthesized counterpart synth_df, synth_groups = ds.synth() print(f"Synthetic shape: {synth_df.shape}") See :doc:`tcga` for the full guide (catalog, normalizations, caching, custom mirrors) and :doc:`notebooks/tcga_quickstart` for a runnable end-to-end example. Generate Synthetic Data ~~~~~~~~~~~~~~~~~~~~~~~ Use :func:`~syng_bts.generate` to train a generative model and produce synthetic samples (see :ref:`generate` in :doc:`methods`): .. code-block:: python from syng_bts import generate result = generate( data="SKCMPositive_4", # bundled dataset name, CSV path, or DataFrame model="VAE1-10", new_size=500, batch_frac=0.1, learning_rate=0.0005, ) # Access results in memory print(result.generated_data.shape) # (500, n_features) print(result.loss.columns.tolist()) # ['kl', 'recons'] print(result.summary()) # Plot training loss (one figure per loss column) figs = result.plot_loss() figs["kl"].savefig("kl_loss.png") # Optionally save to disk result.save("./my_output/") # Load a previously saved result from syng_bts import SyngResult loaded = SyngResult.load("./my_output/") For grouped datasets, ``new_size`` supports two forms: - ``int``: exact total generated sample count. Group counts follow the input group ratio (rounded). - ``list[int]``: explicit grouped counts ``[n_group_0, n_group_1]``. Here, ``group_0`` is the first group value encountered in the input group labels, and ``group_1`` is the other group. Run a Pilot Study ~~~~~~~~~~~~~~~~~ Use :func:`~syng_bts.pilot_study` to sweep over multiple pilot sizes with replicated random draws (see :ref:`pilot` in :doc:`methods`): .. code-block:: python from syng_bts import pilot_study pilot = pilot_study( data="SKCMPositive_4", pilot_size=[50, 100], model="VAE1-10", batch_frac=0.1, learning_rate=0.0005, ) # Access individual runs run = pilot.runs[(50, 1)] # (pilot_size, draw_index) print(run.generated_data.head()) # Save all runs pilot.save("./pilot_output/") Use DataFrame Input ~~~~~~~~~~~~~~~~~~~ Pass your own data as a pandas DataFrame: .. code-block:: python import pandas as pd from syng_bts import generate my_data = pd.read_csv("my_dataset.csv") result = generate( data=my_data, name="my_dataset", # used in output filenames model="WGANGP", new_size=1000, epoch=50, ) Evaluate Results ~~~~~~~~~~~~~~~~ Visualize generated data using :meth:`~syng_bts.SyngResult.plot_heatmap` (on the result object) or the standalone :func:`~syng_bts.heatmap_eval` and :func:`~syng_bts.UMAP_eval` functions (see :doc:`evals`): .. code-block:: python from syng_bts import generate, heatmap_eval, UMAP_eval, resolve_data result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5) # Built-in heatmap on the result object fig = result.plot_heatmap() # Standalone evaluation comparing real and generated data real_data, _groups = resolve_data("SKCMPositive_4") heatmap_eval(real_data=real_data, generated_data=result.generated_data) UMAP_eval(real_data=real_data, generated_data=result.generated_data) Sample-Size Evaluation (SyntheSize) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Evaluate how classifier performance scales with sample size using :func:`~syng_bts.evaluate_sample_sizes` and :func:`~syng_bts.plot_sample_sizes` (see :doc:`synthesize` for full details): By default, :func:`~syng_bts.evaluate_sample_sizes` applies ``log2(x + 1)`` (``apply_log=True``). Set ``apply_log=False`` when input data is already log-transformed. .. code-block:: python from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data # Load real data with group labels data, groups = resolve_data("BRCASubtypeSel_test") # Evaluate classifiers at different sample sizes metrics = evaluate_sample_sizes( data=data, sample_sizes=[50, 100, 150], groups=groups, n_draws=5, ) # Plot inverse power-law learning curves fig = plot_sample_sizes(metrics, n_target=200) fig.savefig("learning_curves.png") You can also pass a :class:`~syng_bts.SyngResult` directly — groups are auto-resolved from the result object: .. code-block:: python from syng_bts import generate, evaluate_sample_sizes result = generate(data="BRCASubtypeSel_train", model="CVAE1-20", epoch=10) metrics = evaluate_sample_sizes(result, sample_sizes=[50, 100], which="generated") Next Steps ---------- - See :doc:`methods` for all synthetic data generation methods - See :doc:`synthesize` for sample-size evaluation with SyntheSize - See :doc:`configuration` for all available parameters - See :doc:`api` for the complete API reference - See :doc:`datasets` for information about bundled datasets - See :doc:`migration` for upgrading from v2.x