Privacy-preserving, high-dimensional, synthetic medical data

ZKI-PH_PhD2025_04 (ZKI-PH1 & ZKI-PH3)

Date:  06/03/2025

Background:

Medical data is increasingly being used for research, policy, and public information. This raises privacy concerns, as sharing access to such resources increases privacy risks. Current research explores synthetic datasets as a technical measure to limit exposure to real data. Progress is being made in generating structured data similar to the original data and in assessing privacy risks.

The Zentrum für Krebsdaten at the Robert Koch Institute aggregates observations from all cancer registries in Germany. With more than 13 million cases, the epidemiological data are presented in tabular form. The new data submitted by the registries now include clinical data, opening up new opportunities for cancer research. This requires methods to ensure patient privacy in the more complex case of relational databases.

The Forschungsdatenzentrum of the BfArM plans to provide secure access to billing data of a large part of the German population for research purposes. The inclusion of electronic health records and prescriptions results in a complex data schema.

Aim/s:

This PhD project aims to research methods for synthesizing complex medical data using AI and assessing the privacy guarantees in the general case of these data model expressed as graph data.

AI methods:

Historical methods for pseudonymization include aggregation or k-anonymity, but suffer from limited utility, susceptibility to linked data attacks, or inapplicability to high-dimensional data. Differential privacy is a rigorous mathematical framework for defining proven privacy guarantees, but its use outside research is limited and complex.

Traditionally, CART and Bayesian networks are effective. Neural networks are the state of the art for synthetic tabular data, with approaches like GAN, diffusion model, and LLM.

Keywords:

synthetic data, medical data, data privacy, generative AI, graph data