An Blueprint for a Health Data organization

Mar 1, 2021 · 4 min read · artificial intelligence biosecurity ·

This is some scratchpadding that happened after talking with a friend of mine which I'm making public because why not? (Famous last words.). This is a fantasy league document (think Etz, the fantasy AI lab) and shouldn't be taken seriously.

The future of healthcare relies on a few foundational technologies. These foundational technologies are advanced through machine learning - algorithms that learn from data.

Data

Data useful for the future of healthcare include:

Genomic sequences and their relation to disease or other traits
Molecules and their drug-like properties to include toxicity, absorption, manufacturability, and bioactivity against targets of interest
The appearance of symptoms or known diseases in a human population, to include the time and location in which they appear

Representations

The commonality between these datasets is they are pairs or mappings. A genomic sequence by itself is not so useful in machine learning prediction tasks, which is typically what we want to do. Therefore we need a dictionary of genome sequences to properties of interest.

A common sense way to represent this data is as a graph. There is several reasons for this. Human DNA is highly similar. Instead of storing whole genome reads, we can store the differences from a template. To reconstitute a original sample it requires far less storage for a small amount of additional compute. Further, it enables a whole universe of analytics that can be run on the data without significant pre-processing.

We can directly add phenotypes to this. For example we can create an edge to a specific part of the genome to a phenotype node, edges from a collection of parts, or even a probability mixture of parts. These can of data can be ingested directly by graph algorithms or graphical models. The data can be transformed into tabular structured data for other models. This kind of data structure is often called a knowledge graph.

Technology layer

The dataset will be very large and require a lot of privacy protection even when anonymized. Researchers should bring their code to the data instead of the data going into the researcher's facilities. Therefore a technology stack must be created in house for running analytics against the data.

Organization

The organization should be a government-charted 501(c)3. This will give it the needed flexibility to take grants from private foundations and allow operating flexibility not available as a government agency.

To offset the costs of data and compute management, fees will be charged. These fees are necessary because compute costs are highly elastic. An organization that is funded by grants would not be to predict these in advance. These fees are not intended to be matched to the actual value of the data.

Commercial usage is permitted for an additional nominal fee. Commercial usage is defined as the usage of machine learning or other statistical models generated from the data for a commercial purpose. For example, a startup using the data to create a diagnosis mobile app would be charged a Commercial Usage Fee. Small businesses (startups) defined as businesses without public stock and with less than 500 employees can pay the Commercial Usage Fee with a SAFE (Simple Agreement for Future Equity) Note.

Access is granted in three year increments. After three years a new application must be submitted and a new fees are charged.

Here is reference pricing:

Billable	Cost
Application Fee	$25
Minimum Usage Fee	$1,000
Commercial Usage Fee	$15,000
CPU Usage Fee	$0.15/CPU/Hour
GPU Usage Fee	$3/GPU/Hour

A nominal application fee is charged to largely to reduce spam submissions. The CPU would be a modern CPU core and 1 GB of RAM. For example, if you want to run an analysis on a virtual computer with 8 CPU cores, 8 GB of RAM, and an attached GPU for four hours: (0.15*8 + 3) * 4 = $16.80

Student Fellowship

A student fellowship should be created for graduate students. This will include lectures and trainings as any similar fellowship. But it will also provide free quota to the data and compute for three years. The cost of the Fellowship will be offset by the revenue generated from the commercial usage fees.