The GDPR for data teams

April 4, 2018

The General Data Protection Regulation (GDPR) is a regulation in EU law on data protection and privacy for all individuals within the European Union. It aims to give control back to citizens over their personal data and to simplify the regulatory environment. It is enforceable from 25 May 2018.

I worked on it as part of my work at Dataiku and I share here some general guidelines and good practices about how data teams - data scientists and data analysts - can work with the GDPR.

Caution, this is basically a tl;dr post, so this post is deliberately not complete. Also, I am not a lawyer or a legal expert, it may contain inaccuracies.

To whom does the GDPR apply?

The regulation applies if the organization collects, directly or on behalf of other companies, personal data from EU residents. The regulation also applies to organizations based outside the EU if they collect or process personal data of individuals located inside the EU.

Definition of personal data

‘Personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person. - Extract of Article 4

Personal data is any information relating to an individual, whether it relates to his or her private, professional or public life. It can be anything from a name, a home address, a photo, an email address, bank details, posts on social networking websites, medical information, or a computer’s IP address.

The GDPR’s definition of personal data is very general and includes many kinds of information which may seem non-personal at first sight. For example, if someone has a pretty rare first name, that could be personal data.

The GDPR has “Special Categories” for sensitive data (racial or ethnic origin, political opinion or affiliation, genetic or biometric data, health related, sex life or sexual orientation, etc.) with additional obligations, such as the requirement of an explicit consent.

Main principles of the GDPR

The GDPR comes with a number of data protection principles which drive compliance (given in the article 5).

Lawfulness, fairness and transparency

Personal data shall be processed lawfully, fairly and in a transparent manner.

Purpose limitation : an important one

Personal data shall be collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes.

Data minimisation

Personal data shall be adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed.

Accuracy

Personal data shall be accurate and, where necessary, kept up to date.

Storage limitation

Personal data shall be kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the personal data are processed.

Integrity and confidentiality

Personal data shall be processed in a manner that ensures appropriate security of the personal data, including protection against unauthorized or unlawful processing and against accidental loss, destruction or damage, using appropriate technical or organizational measures.

Accountability

The controller (~company that determines the data processing) shall be responsible for, and be able to demonstrate compliance with the GDPR.

Lawfulness of processing (article 6)

Before processing any personal data, you need to determine the purpose of the processing and get the consent of the individual (via a check-box in a form for example), unless in the case of a legal contract (including privacy policies), compliance with legal obligation, legitimate interest, vital interest of individual, or public interest.

This article from Postmark gives accessible additional information about privacy policy and explicit consent.

Data protection by design and by default

The article 25 communicates some requirements for data privacy by design and data privacy by default. Here is an extract:

Taking into account the state of the art, the cost of implementation and the nature (...) of processing (...), the controller shall, both at the time of the determination of the means for processing and at the time of the processing itself, implement appropriate technical and organisational measures, such as pseudonymisation, which are designed to implement data-protection principles, such as data minimisation, in an effective manner and to integrate the necessary safeguards into the processing in order to meet the requirements of this Regulation and protect the rights of data subjects. The controller shall implement appropriate technical and organisational measures for ensuring that, by default, only personal data which are necessary for each specific purpose of the processing are processed. That obligation applies to the amount of personal data collected, the extent of their processing, the period of their storage and their accessibility. (...)

In short, you should protect personal data with pseudonymisation for example and you should minimize the amount of personal data for a given purpose. Minimize in terms of:

amount: only necessary for the given purpose,
extent: only relevant data,
period: there is a retention duration policy,
accessibility: data is accessible to a limited number of people.

Pseudonymisation

Pseudonymisation is when “personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately (...)” (extract of article 4).

Encryption with a secret key is for a pseudonymisation technique.

Anonymisation

Whereas pseudonymised data is still considered as personal data, the GDPR is not applicable to anonymous data. (see recital 26) This means among other things that you do not need to get the consent of individuals when working on personal data.

But, transforming personal data into truly anonymous data can be very difficult to achieve. Remember the Netflix case. There is no ideal technique.

Some techniques:

Hash function
Aggregation, K-anonymity (~categorization)
L-diversity
Differential privacy

The optimal solution should be decided on a case-by-case basis, possibly by using a combination of different techniques. This resource written by an independent European advisor is interesting to learn more about the above techniques.

A more concrete example. An enterprise collects information about their customers with their consent for communication (emailing, etc.). The data can be anonymised (with aggregation techniques for example). Then, the anonymised dataset can be given to a data analyst that will be able to run some general analytics (number of customers per country for example).

Rights of individuals

In addition to the principles of the GDPR mentioned above, individuals have a number of rights regarding their personal data:

Right to be informed (art 12, 13, 14)
Right of access (art 15)
Right to object
Right to data portability (art 20)
Right to rectification
Right to erasure (art 17)
Right to restriction of processing (art 18)
Right to withdraw consent
Right not to be subject to a decision based solely on automated processing (art 22)

Some of these rights overlaps. The most challenging one from a data professional point of view in the right to erasure. A clear process should be decided within your team and organization.

Some guidelines for a data project

Document every dataset containing personal data : consent, purpose, retention policy
Split your data in several projects (one project per purpose), limit the access
Pseudonymize/encrypt as much as possible
Anonymisation is hard but brings out of the GDPR

Also, I recommend the following:

Read articles 1 to 7, 15, 17, and 25 of the GDPR
Review the Data Audit of your company
Review processes for data extraction and deletion

A talk of Dataiku with more details is given at PAIOS.io on April 5 2018.