
In my experience working with National Health Service (NHS) data, one of the biggest challenges is balancing the enormous potential of NHS patient data with strict privacy constraints. The NHS holds a wealth of longitudinal data covering patients’ entire lifetimes across primary, secondary and tertiary care. These data could fuel powerful AI models (for example in diagnostics or operations), but patient confidentiality and GDPR mean we cannot use the raw records for open experimentation. Synthetic data offers a way forward: by training generative models on real data, we can produce “fake” patient datasets that preserve aggregate patterns and relationships without including any actual individuals. In this article I describe how to build a synthetic data lake in a modern cloud environment, enabling scalable AI training pipelines that respect NHS privacy rules. I draw on NHS projects and published guidance to outline a realistic architecture, generation techniques, and an illustrative pipeline example.
The privacy challenge in NHS AI
Accessing raw NHS data requires complex approvals and is often slow. Even when data are pseudonymised, public sensitivities (recall the aborted care.data initiative) and legal duties of confidentiality restrict how widely the data can be shared. Synthetic data can side-step these issues. The NHS defines synthetic data as “data generated through sophisticated algorithms that mimic the statistical properties of real-world datasets without containing any actual patient information”. Crucially, if truly synthetic data does not contain any link to real patients, they are not considered personal data under GDPR or NHS confidentiality rules. An analysis of such synthetic data would yield results very similar to the original (since their distributions are matched) but no individual could be re-identified from them. Of course, the process of generating high-fidelity synthetic data must itself be secured (much like anonymisation), but once that is done we gain a new dataset that can be shared and used far more openly.
In practice, this means a synthetic data lake can let data scientists develop and test machine-learning models without accessing real patient records. For example, synthetic Hospital Episode Statistics (HES) created by NHS Digital allow analysts to explore data schemas, build queries, and prototype analyses. In production use, models (such as diagnostic classifiers or survival models) could be trained on synthetic data before being fine-tuned on limited real data in approved settings. The key point is that the synthetic data carry the statistical “essence” of NHS records (helping models learn genuine patterns) while fully protecting identities.
Synthetic data generation techniques
There are several ways to create synthetic health records, ranging from simple rule-based methods to advanced deep learning models. The NHS Analytics Unit and AI Lab have experimented with a Variational Autoencoder (VAE) approach called SynthVAE. In brief, SynthVAE trains on a tabular patient dataset by compressing the inputs into a latent space and then reconstructing them. Once trained, we can sample new points in the latent space and decode them into synthetic patient records. This captures complex relationships in the data (numerical values, categorical diagnoses, dates) without any one patient’s data being in the output. In one project, we processed the public MIMICIII ICU dataset to simulate hospital patient records and successfully trained SynthVAE to output millions of synthetic entries. The synthetic set reproduced distributions of age, diagnoses, comorbidities, etc., while passing privacy checks (no record was exactly copied from the real data).
Other approaches can be used depending on the use case. Generative Adversarial Networks (GANs) are popular in research: a generator network creates fake data and a discriminator network learns to distinguish real from fake, pushing the generator to improve over time. GANs can produce very realistic synthetic data but must be tuned carefully to avoid memorising real records. For simpler use cases, rule-based or probabilistic simulators can work: for example, NHS Digital’s artificial HES uses two steps – first producing aggregate statistics from real data (counts of patients by age, sex, outcome, etc.), then randomly sampling from those aggregates to build individual records. This yields structural synthetic datasets that match real data formats and marginal distributions, which is useful for testing pipelines.
These methods have a fidelity spectrum. At one end are structural synthetic sets that only match schema (useful for code development). At the other end are replica datasets that preserve joint distributions so closely that statistical analyses on synthetic data would closely mirror real data. Higher fidelity gives more utility but also raises higher re-identification risk. As noted in recent NHS and academic reviews, maintaining the right balance is crucial: synthetic data must “be high fidelity with the original data to preserve utility, but sufficiently different as to protect against… re-identification”. That trade-off underpins all architecture and governance choices.
Architecture of a synthetic data lake
An example architecture for a synthetic data lake in the NHS would use modern cloud services to integrate ingestion, anonymisation, generation, validation, and AI training (see figure below). In a typical workflow, raw data from multiple NHS sources (e.g. hospital EHRs, pathology databases, imaging archives) are ingested into a secure data lake (for example Azure Data Lake Storage or AWS S3) via batch processes or API feeds. The raw data lake serves as a transient zone. A de-identification step (using tools or custom scripts) then anonymises or tokenises PII and generates aggregate metadata. This occurs entirely within a trusted environment (such as Azure “healthcare we” environment or an NHS TRE) so that no sensitive information ever leaves.
Next, we train the synthetic generator model within a secure analytics environment (for example an Azure Databricks or AWS SageMaker workspace configured for sensitive data). Here, services like Azure Machine Learning or AWS EMR provide the scalable compute needed to train deep models (VAE, GAN, or other). Indeed, generating large-scale synthetic datasets requires elastic cloud compute and storage – traditional onpremises systems simply cannot handle the scale or the need to spin up GPUs on demand. Once the model is trained, it produces a new synthetic dataset. Before releasing this data beyond the secure zone, the system runs a validation pipeline: using tools such as the Synthetic Data Vault (SDV), it computes metrics comparing the synthetic set to the original in terms of feature distributions, correlations, and re-identification risk.
Valid synthetic data are then stored in a “Synthetic Data Lake”, separate from the raw one. This synthetic lake can reside in a broader data platform because it carries no real patient identifiers. Researchers and developers access it through standard AI pipelines. For instance, an AI training process in AWS SageMaker or AzureML can pull from the synthetic lake via APIs or direct query. Because the data are synthetic, access controls can be looser: code, tools, or even other (public) teams can use them for development and testing without breaching privacy. Importantly, cloud infrastructure can embed additional governance: for example, compliance checks, bias auditing and logging can be integrated into the synthetic pipeline so that all uses are tracked and evaluated. In this way we build a self-contained architecture that flows from raw NHS data to fully anonymised synthetic outputs and into ML training, all on the cloud.
Example pipeline for synthetic EHR data
To illustrate concretely, here is a simple example of how a synthetic EHR pipeline might look in code. This toy pipeline ingests a small clinical dataset, generates synthetic patient records, and then trains an AI model on the synthetic data. (In a real system one would use a full generative library, but this pseudocode shows the structure.)
import pandas as pd
from faker import Faker
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
# Step 1: Ingest (simulated) real EHR data
df_real = pd.DataFrame({
'age': [71, 34, 80, 40, 43],
'sex': ['M','F','M','M','F'],
'diagnosis': ['healthy','hypertension','healthy','hypertension','healthy'],
'outcome': [0,1,0,1,0]
})
# Step 2: Generate synthetic data (simple sampling example)
fake = Faker()
synthetic_records = []
for _ in range(5):
''record = {
'age': fake.random_int(20, 90),
'sex': fake.random_element(['M','F']),
'diagnosis': fake.random_element(['healthy','hypertension','diabetes'])
}
# Define outcome based on diagnosis (toy rule)
record['outcome'] = 0 if record['diagnosis']=='healthy' else 1
synthetic_records.append(record)
df_synth = pd.DataFrame(synthetic_records)
# Step 3: Train AI model on synthetic data
features = ['age','sex','diagnosis']
ohe = OneHotEncoder(sparse=False)
X = ohe.fit_transform(df_synth[features])
y = df_synth['outcome']
model = RandomForestClassifier().fit(X, y)
print("Trained model on synthetic data:", model)
In this example, faker is used to randomly sample realistic values for age, sex, and diagnoses, then a trivial rule sets the outcome. We then train a Random Forest on the synthetic set. Of course, real pipelines would use actual generative models (for example, SDV’s CTGAN or the NHS’s SynthVAE) trained on the full real dataset, and the validation step would compute metrics to ensure the synthetic sample is useful. But even this toy code shows the flow: real data synthetic data AI model training. One could plug in any ML model at the end (e.g. logistic regression, neural net) and the rest of the code would be unchanged, because the synthetic data “looks like” the real data for modelling purposes.
NHS initiatives and pilots
Several NHS and UK-wide initiatives are already moving in this direction. NHS England’s Artificial Data Pilot provides synthetic versions of HES (hospital statistics) data for approved users. These datasets share the structure and fields of real data (e.g. age, episode dates, ICD codes) but contain no actual patient records. The service even publishes the code used to generate the data: first a “metadata scraper” aggregates anonymised summary statistics, then a generator samples from those aggregates to build full records. By design, the artificial data are fully “fictitious” under GDPR and can be shared widely for testing pipelines, teaching, and initial tool development. For example, a new analyst can use the HES artificial sample to explore data fields and write queries before ever requesting the real HES dataset. This has already reduced the bottleneck for some analytics teams and will be expanded as the pilot progresses.
The NHS AI Lab and its Skunkworks team have also published work on synthetic data. Their open-source SynthVAE pipeline (described above) is available as sample code, and they emphasise a robust end-to-end workflow: ingest, model training, data generation, and output checking. They use Kedro to orchestrate the pipeline steps, so that a user can run one command and go from raw input data to evaluated synthetic output. This approach is intended to be reusable by any trust or R&D team: by following the same pattern, analysts could train a local SynthVAE on their own (de-identified) data and validate the result.
On the infrastructure side, the NHS Federated Data Platform (FDP) is being built to enable system-wide analytics. In its procurement documents, bidders are provided with synthetic health datasets covering multiple Integrated Care Systems, specifically for validating their federated solution. This shows that FDP plans to leverage synthetic data both for testing and potentially for safe analytics. Similarly, Health Data Research UK (HDR UK) has convened workshops and a special interest group on synthetic data. HDR UK notes that synthetic datasets can “speed up access to UK healthcare datasets” by letting researchers prototype queries and models before applying for the real data. They even envision a national synthetic cohort hosted on the Health Data Gateway for benchmarking and training.
Finally, governance bodies are developing frameworks for this. NHS guidance reminds us that synthetic data without real records is outside personal data law, but the generation process is regulated like anonymisation. Ongoing projects (for example in digital regulation case studies) are examining how to test synthetic model privacy (e.g. membership inference attacks on generators) and how to communicate synthetic uses to the public. In short, there is growing convergence: technology pilots from NHS Digital and AI Lab, national strategies (NHS Long Term Plan, AI strategy) promoting safe data innovation, and research consortia (HDR UK, UKRI) exploring synthetic solutions.
Conclusion
In summary, synthetic data lakes offer a practical solution to a hard problem in the NHS: enabling large-scale AI model development while fully preserving patient privacy. The architecture is straightforward in concept: use cloud data lakes and compute to ingest NHS data, run de-identification and synthetic generation in a secure zone, and publish only synthetic outputs for broader use. We already have all the pieces – generative modelling methods (VAEs, GANs, probabilistic samplers), cloud platforms for elastic compute/storage, and synthetic-data toolkits for evaluation and UK initiatives that encourage experimentation. The remaining task is integrating these into NHS workflows and governance.
By building standardized pipelines and validation checks, we can trust synthetic datasets to be “fit for purpose” while carrying no identifying information. This will let NHS data scientists and clinicians iterate quickly: they can prototype on synthetic twins of NHS records, then refine models on minimal real data. Already, NHS pilots show that sharing synthetic HES and using generative models (like SynthVAE) is feasible. Looking ahead, I expect more AI tools in the NHS will be developed and tested first on synthetic lakes. In doing so, we can unlock the full potential of NHS data for research and innovation, without compromising the confidentiality of patients’ records.
Sources: This discussion is informed by NHS England and NHS Digital publications, recent UK healthcare AI research, and industry perspectives. Key references include the NHS AI Lab’s synthetic data pipeline case study, NHS Artificial Data pilot documentation, HDR UK synthetic data reports, and recent papers on synthetic health data. All cited materials are UK-based and relevant to NHS data strategy and AI development.