Wednesday, April 20, 2016

FHIR - Input Validation

Updated: Vadim Peretokin advises on the FHIR chat : You're better off in the world if you know about this stuff though. lists some XML-related vulnerabilities and is pretty easy to learn from.
It has happened again. This time Michael Lawley reported that the HAPI reference implementation was susceptible to XXE attack -- Grahame's email to the FHIR list:
Yesterday, Michael Lawley reported that the HAPI reference implementation had a security flaw in that it was susceptible to the XXE attack. Those of you interested in details about XXE can see here:
The various XML parsers in the the various reference implementations are variably affected by this; we are releasing patches for them now.

Specifically, with regard to the java reference implementation, it has always ignored DTD definitions, so is immune. Any newly released versions will change to stop ignored DTD definitions, and report an error.

The current validator is susceptible to the attack; I am still investigating older versions, and will advise. Once I've done that, I'll check the pascal reference implementation.

Other reference implementers can advise with regard to HAPI, the DotNet reference implementation, and the various other RIs (swift, javascript, python...)
Note that this is an XML issue - your parsers have to be correctly configured. So this is equally likely to be an issue for anyone processing CDA, and even anyone using v2.xml
With regard to the FHIR spec, since the standard recommended mitigation is to turn off DTD processing altogether, I've created a task that proposes making the appearance of DTDs in the instance illegal (#9842)

This issue is not unlike the embedded SQL Injection that Josh found two years ago (almost to the day). Which at the time I decided Josh needed recognition and gave him my Murky Research Award. After that we updated the FHIR specification with a section on being robust to narrative sections. We likely need to update this section to be more on Input Validation with SQL injection and now XXE as examples.

There has been some 'discussion' following this where people want to put out that this XXE example is further proof that XML is inferior to JSON. They should note that the embedded SQL injection problem exists for XML, JSON, or any other encoding format. There are sure to be JSON specific issues.

Input Validation

The solution to both of them is the same mantra from the CyberSecurity community – Input Validation. (Note this is the same answer that the Safety (e.g. FDA) will tell you). You must inspect any input you receive from elsewhere, no matter how much you trust them. This even applies to receiving data from your own systems components (e.g. reading an object from persistent storage, even in the case where you wrote it there). All CyberSecurity frameworks (e.g. NIST, OWASP, ISO 27000, Common Criteria, etc) have a specific section on Input Validation.

Input Validation is really nothing more than a specific side of Postel's Law – Be specific in what you send, liberal in what you accept. It is the liberal part of that that is the focus here. In order to be liberal, you should be thinking that you should expect wide variation in what the other guy is going to send you. Including simple garbage, and carefully crafted malicious attack. Both are possible, and although Halon's razor would have you attribute the bad input to stupidity; it still must be defended against.

Input Validation means you need to do some extra homework. Much of it is already done by FHIR specification, but further 'profiling' is often needed. Where FHIR Profiling is defined, it is just a s valuable for Input Validation as it is for use-case clarification.  But FHIR based Profiling is not enough. It doesn't cover things like
1. String Length boundaries
2. String character encoding restrictions
3. Permitted characters vs not permitted characters.
4. element range expectations

What you want is to understand well what the data SHOULD be. An approach that looks only for BAD data, will be fragile. There is an infinite set of bad data. So any approach that specifically codes to detect bad data will only be good until tomorrow when some hacker has identified a new kind of bad data.

The Input Validation sub-system often can't reject a transaction, but it can neutralize data that is not good. It can eliminate that data, it can translate the characters, it can encapsulate them, it can tag the bad data, etc.

The main difference between XML and JSON; is that the tooling for XML is likely to be more generous. Such as the DTD problem. The default behavior of the XML tooling is to follow these, as the most likely beginning programming project likely wants that. However you must look carefully at your toolking for Input Validation – Robustness – settings.

Performance vs Robustness

Many will balk at the Input Validation need, saying that to do tight input validation – while being liberal – will cause their interface to be too slow. I agree, it is likely to do that. This is where a mature product will be intelligent. It will start out communications with a new sender in a very defensive mode, as it gains experience with that it can eliminate some of the Input Validation. Note that this is only possible when you have strong Authentication of the sender, so that you can be sure that it is indeed that sender sending you data, and that no entity can be injecting content. Never would all input validation be eliminated. You just always expect that the sending system could get compromised and thus start sending you garbage that it never sent before. Thus the really mature systems have a sliding scale of robustness, backed by historic pattern from that sender, and tested occasionally. Static rules are no better than never having Input Validation rules.

References to various Security Framework guidance – this is not new to the CyberSecurity community

Postscript from Rob Horn

Rob wrote this fine email at the same time I wrote mine. His perspective is very complementary so I asked if I could add it to my article. He agreed.
The problem is not XML per se. The problem is present for any approach that requires a public facing tool. XML is impenetrable without extensive tooling, so it is indirectly responsible. But any and all public facing tools are a risk.

We are not in the golden days of idyllic safety on the Internet.

Healthcare is under direct intelligent attack by malicious actors. All tools are under attack. There is no exception for "it's just educational", or "it's just for standards", or "there's nothing of value to steal". These are not pimply faced dweebs living in their parents basements. These are teams of organized and skilled experts, supported by large bodies of helpers. They include organized crime, hostile nations, etc.

It's good practice to treat all public facing tools with the same care that you give to the tools for patient access, operational use, etc. It's going to become necessary as the attack intensity escalates. We're in the business of providing this kind of product for our customers, so we should all have the skills and ability to maintain this level of protection and quality. If you can't do it, you shouldn't be in this industry. It's more work than we might like. But bad habits spread and the attackers are increasingly working to find twisty trails through secondary and tertiary access points. Penetrating HL7 and HL7 members is a great way to indirectly penetrate the rest of healthcare.

Most of the present active attacks are only described under non-disclosure. But, the publicly disclosed attack by Iran on an obscure little dam in New York state indicates the extent of attacks. This little dam was about as harmless as they get. You could blow it up and the worst that would happen is some wet basements. It didn't generate electricity. All it did was maintain the steady flow of a little river. So why did Iran take over the industrial control system for that dam?

My guess is a combination of practice for operators and intrusion normalization. As a practice target it was great. Nobody would notice the penetration. Nobody would get hurt. This is good for advanced training practice. Normalization is something that I worry about regularly for audit protections. A lot of current audit analysis looks for the abnormal. If penetration indications can be made normal then looking for the abnormal becomes less effective. Intelligent attackers know and understand the defensive methods and do take actions to make them less effective. The kid in a basement might not think this way. The professionals certainly do.

Kind Regards,
Robert Horn | Agfa HealthCare
Interoperability Architect | HE/Technology Office

Monday, April 18, 2016

Consent given to authorized representative

I asked for use-cases on the FHIR chat so that I could model them. This effort to model use-cases is useful as it helps test the theory with reality. So the next one I got was from Andrew Torres who asks about an authorized representative.
Have you thought about the authorized representative use-case? I think we spoke about this briefly at the WGM in January. Patient facing applications will need the ability to understand who is accessing data, and the app will need to understand what data that person can see. An example use-case is a child going to a children's hospital doesn't grant consent. The parent will be his authorized representative to view patient health information. So in a patient portal or an application the parent would be able to view data because they are authorized to do so. The authorization, or contract as we have modeled it in FHIR, would expire, while the relationship will always exist.

There are a couple of things worthy of modeling. There are also some things that I will stay away from so as to not step into a space beyond what the Privacy Consent Directive is intended to solve.

First I am going to move this from child/parent relationship
, up to a more general case of ‘authorized representative’. In doing this I want to distance this use-case from the difficult topic of how one gains the authorized ‘consent’ from someone that is potentially not legally able to give consent. So all I want to do is focus on what the consent would look like, and ignore how we got to that point. The process getting to that point is varied by region, and individual circumstances. This process of getting to this point is not important to the model.

Second I am going only to focus on giving this authorized representative the ability to access the data about the patient. Meaning I am not going to address other authorizations, such as power-of-attorney or alternative-decision-maker.

So what would a Privacy Consent Directive – Contract – look like that shows that consent was given to a specified authorized representative? Note that in this case the representative does not gain more rights than the patient would have. In my case I am going to indicate that the representative should be limited to only view/read. Meaning that where a Patient has a “Privacy Right” to amendment or correction; in this case the representative is not going to be granted those rights.

The Basics

These elements are not too special. Note that I am back to using the LOINC code 57016-8, as I know how this is used. I am not clear on how 64292-6 is used, and it also has some “Method” constraint that I don’t understand. Maybe someone can explain these to me so that I can better represent these codes.  I know that some don't like that 57016-8 has the name "Privacy policy acknowledgment". This name was chosen by IHE when we asked for this code of LOINC. The name was chosen to cover the broadest possible use-cases. At the time the word "consent" was considered too limiting. The description tries to make this clear "A document showing patient acknowledgement/consent/dissent with respect to the privacy policies of an organization. This document is specific to an individual patient."

Andrew also indicate that this authorization should expire. So I will put an expiration date. We really haven’t fully modeled expiration, but we do have an ‘applies’ element that can indicate the timeframe in which the Contract applies. Thus after that timeframe it is implied that it is no-longer applicable.
  • Contract.identifier --- everything needs unique identifiers
  • Contract.issued --- date and time that the consent was captured
  • Contract.applies --- date range that this consent is valid. Often a start date is indicated, sometimes an end date 
  • Contract.subject --- pointer to the Patient resource. This sets the context of the consent, it is ‘about’ this patient.
  • Contract.type – { "system": "",”code”:"57016-8"} --- This is a Privacy Consent Directive

The top level instruction:

The subType needs to indicate that this is an authorization only applying to the one individual. I don’t  find a code specific to that. But I think that OPTIN will be okay for now.

The authorized actions:

So, this authorized representative should only be able to read data, not create or update. And they would be using a PurposeOfUse indicating that the Patient Requested (PATRQT). Since this is all that is being authorized, I am modeling these at the root level of the Contract. I could have left these blank and modeled these in a Contract.term. I am not sure why one would do one over the other, however having both options seems like it is going to create a problem.
  • Contract.action – {“system”:””, “code”:“read”} – allow all read operations
  • Contract.actionReason – {“system”:””, “code”:”PATRQT”} – allowed for patient requested purpose
Seems we need a broader term than RESTful 'read'. There is some work ongoing on designing some higher level vocabulary that could group all of these read operations including query.

The authorized representative:

The person being given the access rights of the patient is considered the “Grantor”. The person who has been granted the authority. The problem we have is that although today the model for Contract.agent seems to be the place where we could identify the grantor, there is not yet a vocabulary. So I will steal from the Contract.signature vocabulary…
  • --- the pointer to a RelatedPerson object describing the individual being given authorization
  • Contract.agent.type = {“system”:””, “code”:”GRANTOR”}

Other considerations

In the use-case given this is an authorization that is not stating anything about an organization or location. So I would expect to find Contract.authority, and Contract.domain empty. I would expect empty means unconstrained. Thus this Consent should be considered both not specifically endorsed by an organization or location; but also not restricted to an organization or location. Clearly if there is an organization or location scope; then these could be specified. Not sure if this is the right interpretation.

Variation – consent to an organization with exception of an individual at that organization

There is already an example in the Privacy Consent Directive IG for the explicit opposite of this. One where a Patient is explicitly indicating an individual that shall never have access. One of the specific differences is that this is otherwise an authorization (consent) for an organization to have access/use/disclosure, with an exception to not allow access to a specific individual. This drives a different encoding.
Patient "P. van de Heuvel" ex-spouse, Bill T Lookafter is a Nurse that is employed by Good Health Hospital. P. realizes that she may receive treatment at GHH, however she does not want her ex-spouse to have any access to her IIHI. She indicates that she would like to withdraw/withhold consent to disclose any instance of her health information to her ex-spouse as a result of his employment at Good Health Clinic.


I am mostly using this as an exercise to test what we have today. So, this might not be where the Privacy Consent Directive and Contract end up. Please participate in the sub-workgroup meeting held Fridays.

HL7 v2 vs FHIR

I was in a discussion around HL7 v2 vs FHIR lately. The technical merits of FHIR were not being debated. The technical merits were understood and agreed. The only outstanding difference is network bandwidth efficiency and processing expense. There is really nothing one can do in FHIR that one can't do in HL7 v2. Yet the question persisted, what is the reason why FHIR will win over a more compact solution like HL7 v2?

Simply: The people that know HL7 v2 are retiring. 

At best there is a small fixed pool of people today that understand HL7 v2; while the need to enable Interoperability in Healthcare is expanding fast. The new developers come to the healthcare world ready to use the technologies that FHIR is based on. They can be productive right away. Yet to re-train them on HL7 v2 is expensive and time consuming. The efficiency of the network bandwidth is not critical, Networks get faster, CPUs get faster, the technology will overcome.

The productivity of new developers is critical.

Wednesday, April 13, 2016

Patient ID is critical to Enabling Privacy

A very short article this week really brings the problem of Patient Identity to a point. Specifically this:
Dr. Charles Jaffe, CEO of standards development organization Health Level Seven International, said Tuesday at the 13th annual World Health Care Congress in Washington that Kaiser Permanente Southern California had records of 10,000 people named Maria Gonzales. Ten thousand.
That is 10,000 opportunities for a FALSE match, aka a false-positive. That is where the data of the wrong person is being used to treat someone. From a Medical Practice, and Medical Safety perspective this scares me to no end!

But that is not my focus in this article. Privacy Enabling is.

I know that the people in Healthcare really want this problem resolved. However in the USA we are up against a forbiddance of USA Government funding of a national patient ID. There was concerns that it would present a Privacy risk. I however think that by not having a national patient ID we have a much worse Privacy risk. As today we are forced to expose all the demographics that we know about the patients we know, so that hopefully a match can be made. That should be enough of a Privacy violation to change the attitude. With a strong identifier, we would need only communicate that identifier (should include some other demographics for safety reasons).

But there is more Privacy violations, given that we don't have a solid identifier we can't have solid Privacy Consent Directives. We can within a realm that has a solid identifier, but that breaks as soon as one moves out of that one controlled environment.

More Privacy violations as we can't then give patients deterministic access to their own data, or control of their own data, or even an accounting of uses or disclosures of their data.

Privacy Principles would be enabled by a strong national patient identifier.

We are reverse engineering a national patient identifier by correlating poor-quality but highly sensitive demographics. We have made a central database of stuff that is very valuable to the black-market. I point out that our Patient Matching problem is the same solution as the black-market uses when they re-identify a de-identified dataset- Patient Matching as a Science. We have the worst of all worlds.

Note however, we will still have false-positives and false-negatives and john-doe; but the problem shrinks significantly.

I covered this very topic back in 2012. Universal Health ID -- Enable Privacy. In this article I go much deeper into the Privacy ‘risk’ and the Privacy ‘solution’. We can’t have stalemate.

Patient Privacy is enabled when we have strongly assured Identifiers. We don't even need to invent a new system. We just need to use the identifiers that we have already. It would not hurt to have a new system of trustable opaque identifiers that support federation.

See my blog topics on:

Monday, April 11, 2016

Consent to grant read access to a specific types of FHIR Resource

Grahame got this question on FHIR Consent, and forward it to me to answer.
Question: I am using the FHIR Contract resource ( ) to convey the patient consent for a provider to access specific FHIR resources (Ex: Observation, MedicationOrder, DiagnosticReport…). Which field in the Contract resource can be used to specify the list of consented FHIR resources?
The short answer is that today this is unclear as there are many ways to do it. This is a problem that I struggle with and intend to use this blog article to help narrow the solution space so as to make progress in the modeling. The Privacy Consent Directive (PCD) Implementation Guide is where the CBCC and Security workgroups are building the solution. We are making progress, but not as much as I would like. We tend to spend far too much time re-arranging the chairs, and too little time making solutions. I like the Question, as it gives a concrete thing to focus on.


I cover the background in electronic Privacy Consent -- Patient choice which speaks to more than just FHIR Consent.

The PCD implementation guide does include a use-case that is very similar to the one in this question. The specific use-case originally comes from our Canadian participants. It is the first use-case to not disclose any lab-results.. The unfortunate thing is that this logical use-case is very difficult to execute given the FHIR modeling based on “Resource” design. In the FHIR data-model design, Resources are defined, where the various types of Resources are a logical grouping of similar data, or data needed to achieve a goal. It is not laid out like a Healthcare Clinic or Hospital is laid out; according to clinical specialty or department. Thus there are common structures like an “Order” or "Observation" that are used by all departments, and thus there really isn’t a type of data that is specific to the “Laboratory”. This problem is not part of the Question that was asked. 

The Question that was asked is purely about using the FHIR Resource model; presuming there is not a problem with the dissonance with how people think about the data vs how FHIR chose to organize the FHIR Resources.

Overall need to encode

So overall we have all the usual stuff that is needed to record that a consent was captured from and applying to a specific patient, covering a specific set of organizations, for a specific timeframe, locations, etc.

What is unique about the Question is that they want to say that the consent is granted only for a set of FHIR Resource types (e.g. Observation, MedicationOrder, DiagnosticReport, etc…). So the exercise is to figure out where would one say that the consent is ONLY for these specific types of FHIR Resources.

The solution is to use the Contract.term, which is 0..* element where the specific terms of the consent can be itemized. I would then indicate a Contract.term that can then list all the Resource types that are to be allowed access. This fits nicely into Contract.term.subType. What we don’t have is an obviously selected vocabulary to say: “Allow access to any data with the FHIR Resource type listed in subtype”. So what I use below is to use the RESTful actions. Thus allowing ‘read’ action upon the type of resource for the purpose of treatment

Contract - Basic of a Consent

In this case, since we want to identify specific rules that ALLOW access, we must start with the default deny rule.
  • Contract.identifier --- everything needs unique identifiers
  • Contract.issued --- date and time that the consent was captured
  • Contract.applies --- date range that this consent is valid. Often a start date is indicated, sometimes an end date 
  • Contract.subject – pointer to the Patient resource. This sets the context of the consent, it is ‘about’ this patient.
  • Contract.authority – what is the organization(s) that is covered by this consent. 
  • Contract.domain – what locations are covered
  • Contract.type – { "system": "",”code”:" 64292-6"} à This is a Privacy Consent Directive
  • Contract.subType – {“system”:””, “code”:”OPTOUT”} --- Forbid access except as indicated in terms
There is other things one can include in the basics, but that is not the specific topic of this blog article.

Magic for this Question

We then can just list all the FHIR Resource types that we allow, and for what action we are allowing (treatment in this case).
  • Contract.term.type – allow à {“system”:””, “code”:“ read”} – allow all read operations
  • Contract.term.subType -- {"system": "", "code": "Observation"}
  • Contract.term.action – {“system”:””, “code”:”TREAT”} – allowed for Treatment purpose
  • Contract.term.type – allow à {“system”:””, “code”:“ read”} – allow all read operations
  • Contract.term.subType -- {"system": "", "code": "MedicationOrder"}
  • Contract.term.action – {“system”:””, “code”:”TREAT”} – allowed for Treatment purpose
  • Contract.term.type – allow à {“system”:””, “code”:“ read”} – allow all read operations
  • Contract.term.subType -- {"system": "", "code": "DiagnosticReport"}
  • Contract.term.action – {“system”:””, “code”:”TREAT”} – allowed for Treatment purpose

Conclusion and Discussion
This is just my current view. Unfortunately this space is slowly evolving. I encourage participation by those that have specific problems. As specific problems can be used as a priority driving force. 

Thursday, April 7, 2016

Patient Matching as a Science

A critical science in healthcare that has many dimensions and use-cases or misuse-cases.

De-Identification -- Break the binding:

I have been involved lately with a few De-Identification projects. To be complete De-Identification, Anonymization, and Pseudonymization. Where the goal is to end up with a set of data that is useful for some research project, yet has as low of a Privacy risk to the individuals for whom the data is about.

These efforts go through great length to remove Direct Identifiers, those values that are publicly known to uniquely identify a single individual. For example a Driver’s License number, Passport number, Medical Records Number, Email Address, Personal Phone Number, etc.

These efforts then struggle with the Indirect Identifiers, also known as Quasi-Identifiers. These are values that are not unique to that individual, but do describe a narrow aspect about the individual. For example a birth day, gender, postal/zip code, etc. There is also the 'little' issue about free-text fields.

The struggle with De-Identification is that these Indirect Identifiers are often needed by the research project. They very often need to know the gender, age, and region they live. Thus often times these efforts leave some risk.

The concern is that with some risk left in a de-identified dataset there is a possibility that someone who has legitimate (or illegitimate) access might try to re-identify the individuals and thus violate privacy. This is an ‘attack’ upon the de-identified dataset.

Patient Identity Matching -- Make the binding:

I have also been involved lately with a few Patient Matching projects. Where the goal is to end up with a cross-reference between many different Patient Identifiers, that is to identify when two different Patient Identifiers are actually about the same human. This is often referred to as De-Duplication, as you are removing duplication, when you are actually not removing it but just assertively acknowledging it.

These Patient Matching projects are most prevalent in the USA, where our government has forbidden funding to even discuss a national Patient Identity project. Thus in the USA, Patient Identity Matching, is the only choice. This is not really true, the private sector can solve the problem; but the healthcare private sector is far to fragmented to work together on this… Kind of true, more to come on that… My view is a good Patient Identifier enhances Privacy.

Binding Methodology:

I see these as two sides of the same coin. In the one case we are struggling to break any identification linkage, where as in the other we are trying to use any fragment of truth to create linkages. The motivations are very different, the outcome is very different; but the methods are very much the same.

Correlations between direct identifiers gives a positive match. Correlations between indirect identifiers gives evidence of a possible match. Each possible match has a strength based on that specific indirect identifier population characteristic (gender only gives a 50% confidence). Some threshold of ‘possible’ matches is considered sufficient to indicate an actual match. Any dissonance breaks any matches, or indicates dirty data. 

Data is often sub-optimal, aka dirty. Dealing with False-Positives, and False-Negatives turns into more art than science. 

Risk... There is always risk, no matter how you slice it.

My other blog articles on these topics can be found at De-Identification, Anonymization, Pseudonymization, and Patient Identity.