Google Fonts and what constitutes personal data

I have yesterday written a brief response to the judgment handed down from Landgericht Muenchen regarding the legality of the use of Google Fonts under the GDPR, the issue being the finding that use of Google Fonts on its own constitutes a data breach that needs to be fine (in this case €100, but going to €250,000 for future contraventions). In this post – directed at the legal, regulatory and law-making community – I am making the case why this judgment is wrong. Or rather why it must be wrong, lest the Digital Europe Programme is bound to become a farce.

A short detour into the Cookies Directive

Before going into the details of this case and GDPR I would like to lead with another infamous European regulation in the digital space, the infamous Cookies Directive, also known as the Privacy and Electronic Communications Directive 2002/58/EC. It was a good idea at the time: privacy was starting to become a concern, people were (rightly) concerned about being tracked on the Internet, and the EU decided that informed consent was the way to go. So far so good.

Now, 20 years later, we see that this did not quite turned out the way it was intended. Privacy on the Internet is a greater concern than ever, and cookies are still one part of the story. So what happened? The Cookie Directive did not forbid the use of cookies – reasonably so – but required a consent banner on every site that uses cookies, so that people can decide. It turned out that (a) people are not really able to make this decision because it is not clear what the implications of accepting cookies are, and when those implications would manifest themselves. So people by and large just hit Accept and this in turn lead to (b) that noone really had a choice: there was no competition of cookie-using and non-cookie-using sites that people could choose from. The only choices were, and are (1) use the Internet and accept cookies, or (2) do not use the Internet.

Now this sounds like a bit of a nuisance for users and a bit of an incremental cost every time someone builds a website, but otherwise there is no harm done. This, in my view, is very far from the truth: the Cookie Directive in its current manifestation is actively detrimental to the security of the Internet, and should be scrapped. Why? Because it monkey trains all of us into very bad behaviour. Whenever we go onto a website we know there will be a banner and we know we have to press Accept because anything else will lead to eternal pain. So now whenever we see any pop-up on a website that says Accept or Allow we have a primal urge to just press the button as to not prolong the pain. Now I am a reasonably sophisticated Internet user and spend way too much time online – and even I have repeatedly Allowed sites to send me notifications. In fact – even know I still have to stop myself pressing this button all the time. Now site notifications are not too bad and most of us will only spend 15 minutes of our lives that we’ll never get back trying to figure out how to turn them off (or an hour on the phone with our parents). But as more and more things happen over the Internet the Allow boxes on the Internet become more and more important. And there are some edgy sites outside, so instead of accepting cookies you may well end up selling your 50 ETH Bored Yacht Club Ape for a song.

What is the key learning here? Privacy is complex, and a well-meant regulation may not work as well as intended and even have outright negative side effects. Also there is a natural life cycle for regulations, and the cookie banners are way beyond their useful life.

A short review of the judgment

Before I go further I want to quickly recall what the judgment was about. In this case a website operator was using Google Fonts. The operator was sued by a website visitor (not a paid user if I understnad it well, just someone who went to the web site) because they transmitted their personal data to the US, which was not legal (at least not without consent and/or without a good reason) under GDPR and Schrems.

The Landgericht Muenchen accepted this, and the website operator was ordered to pay €100 to the claimant, but with the amount increasing to €250,000, and even threatening jail time, in case of future violations. The key arguments in the judgment were the following

The IP address of a visitor is personal data; this is a key consequence of Schrems, and this is the first area where in my view the regulatory and legislative community must make changes; this classification is not reasonable with a differentiated consideration of the context, and will do significant harm
By sending the visitor to fetch a resource (a font, in this case) the operator transmits private data to Google, and thereby outside the EU boundaries, without consent; there are multiple points to be made here, let’s start with the fact that it is remarkable that the judge did not even mention that it is not the website operator that sends the data to Google, but the visitor, or rather their browser
The hosting of the font at Google is not necessary because the font could be hosted locally; again a lot to say here but I would have expected a discussion of the benefits of hosting common resources in a way that (a) reduces web traffic overall, and (b) lowers the cost of hosting for medium to high traffic sites

What is wrong with the judgment?

Below I want to go through a number of key points of what was wrong with the judgment. Now to be clear, I am not a lawyer and the judgment may have been correct, even though my impression from reading the judgment is that I would have wished more technical briefing provided to the judge. But if this judgement is correct, then the law on which this judgement is based is not fit for the purpose.

An IP address is not personal data!

An IP address can – but does not necessarily – identify an individual

The first point to make – and this is a key point – is that an IP address is not automatically personal data, for a number of reasons. Firstly, an IP address does not usually identify a person. In some cases it does, for example when using a mobile network where every device will have its unique IP adress. However, in most cases an IP address identifies a group of people and, increasingly, their devices. I am only half joking if I am saying that nowdays it may well have been my TV, fridge or dish washer that requested Google Fonts.

In a household setting, all members of the household will share a single IP address, so with the IP adress alone it will not be possible to identify the person in question. However, within a household the number of people is usually very limited, so it is true that with a little additional information beyond the IP address a person may be identified. A corporate setting is similar – except that now 100s or even 1000s of people may share a single IP address. The judgment pointed out that it does not matter whether the recipient of the data can actually identify the person in question, but it only matters whether they can be identified in principle. This may or may not be the case in a corporate context, depending on the logging log retention policy of the company.

The summary here is that

An IP address usually identifies not an individual but a group of individuals. Whether or not it is possible to identify an actual indivdual depends on the circumstances, as well as on additional data that may be available

An IP address, like a name, depends on context

An IP address is a necessary identifier when communicating on the Internet as it determines where the traffic is routed. As discussed above, it may not be a complete identifier, but in this case, whilst the connection is open, the routers involved will retain the additional information needed to ensure that the data reaches the person for whom it is intended. This information may however never be shared and is not persistent so it is not usually suitable for identifying the person in question.

In this sense and IP address is like a name. And whilst names certainly are private information in some respect this really depends on context. For example, if I order a coffe in Starbuck I tell the barista my name and when the coffee is ready they are calling out for “Stefan”. In other words: they are divulging my private information to the public. Nevertheless I do not expect Starbucks to protect my privacy here, eg by introducing a number-based system. The information they divulge is reasonably trivial (my name is Stefan, and I am drinking a skinny latte with no sugar) and in any case I could have told them my name was John Similarly, when my daughter started school there was a board with her name on it, indicating to which class she was meant to go, we have similar lists for after school clubs, and at the recent birthday of a friend my name appeared on the list of tables, as well as on the sign on the table itself.

Of course there are other examples. If I am at the doctor and just had an STD test I would hope the nurse does not should my test results across the room, or puts them up at the Internet. In fact – if it was a clinic specializing in STDs I may not even want to make the fact that I took a test made public regardless of the result.

What am I trying to say with this? Firstly, my name on its own is not personal information. “Stefan Loesch” without context does not contain any meaningful information. It only becomes meaningful within a context, and we need to make a judgment whether the amount of personal information disclosed is commensurate with the necessity of achieving the goal. Which of course the judge in this case did, under point (3) in the summary. I think he got it wrong here, but it seems that the principle itself is well established

IP addresses, like names, are not personal data on their own, but they become important in context; in any case a judgment call must be made between the necessity of divulging personal information, and the benefits obtained.

What data was sent?

We have established that the IP address alone can not reasonably considered personal data, but that an IP address in context can. So now we have to establish the context, so let’s have a look what can possibly be sent.

First party data, sent to the headline recipient

First party data is sent to the website that you are visiting, so if you navigate to topaze.blue then this is the data you provide to topaze.blue

URL. The URL is the name of the resource you requesting. This may be /blog/22/02/02, or it may be /your-std-result or /living-with-aids. In other words: URLs contain very important information when they can be correlated with an individual
Cookie. The Cookie is a collection of data that is transmitted to the website every time a user requests a page. In particular used for user identification (user id) and authentication (access token)
eTag. An etag (short for identity tag) is a unique resource identifier, designed for caching. It may refer to a generic resource where the etag is the same for everyone who request the resource. Servers however can also choose to allocate more granular etags, allowing them to be used for user identification.
User agent string. The user agent string identifies the browser as well as the system on which the browser is running. It contains a plethora of information, and when correlated with other information it often allows to uniquely identify a user ("fingerprinting)
Headers. Headers is a generic term for data that is sent from the browser to the server. Cookies, etags, user agent etc all are part of the headers. A website can also choose to send arbitrary custome header to a server
X-Client-Data. A specific header used by Google Chrome when communicating with Google sites. It is meant to replace the user agent string in a more privacy respecting manner, but it nevertheless often serves as a unique identifier of the user in question

Third party data, sent to retrieve embedded 3rd party resources

Above we talked about first party data, ie data sent to the website with which you interact. There can be some privacy concerns around this – say the site tracks returning visitors who repeatedly look at the same holiday trip and adjust the price in accordance with that. However, by and large there is an understanding that the data collected by a website about a visitor is not necessarily a privacy issue because (a) it only provides a very limited window into the visitor’s data, and (b) there is a good chance that this data is useful to provide a better service.

The main area of concern nowadays is third party data, ie data that goes not to the visited website, but to other servers, typically owned by Google, Facebook, Amazon or some of the big advertising networks. The issue here is that those servers, because of their ubiquitous presence on the Internet, see a much more complete picture of an individual, provided they can correlate the different viewings of the same individual whenever they encounter them on one of their servers.

I will go through the list above again here, but need to mention two additional concepts in this context

Third party cookie. A third party cookie is a cookie sent to a resource that is not the website you are visiting. Eg if you are on topaze.blue and topaze.blue embeds a YouTube video, then the cookie sent to YouTube.com is a third party cookie. What is important is that those cookies are not restricted by the primary domain, ie the same person visiting YouTube via idcap.org would send the same exact cookie to YouTube
Referrer. The referrer is yet another header, and it indicates the primary resource that triggered the acceess to the third party resource. There are two possibilities for referrer headers (three if you count the empty one): either only the website is sent (https://topaze.blue) or the entire URL (https://topaze.blue/blog/22/01?token=1233). Note that the URL parameters are also forwarded.

Of the other items mentioned above, the 1st party cookies is not sent to the third party. However, the user agent string is, the etag (of the cached resource) is, and last but not least from Chrome to Google real estate the x-client-data header is.

Modern browsers typically allow to prevent the sending of 3rd party cookies, and often enable this setting by default. That is a reasonable precaution anyone should probably take. However given the above this does not necessarily solve the issue as the other data, together with the IP address, may be sufficient to uniquely identify the user. But we now need to take a step back, and ask ourselves that the personal data here really is.

As we’ve laid out above – the fact that Stefan Loesch accesses a generic font on a Google server is not really personal data. It is too boring to be interesting. The fact that Stefan Loesch accesses a specific video on YouTube may be a different story – but in this case YouTube is arguably the primary server, and tracking visitors on your site only is a much less nefarious activity than tracking them across the entire web. Where this all becomes a major privacy issue is with the referrer header. Because of this header, Google is able to track me whereever I go on the Internet provided the site uses Google Fonts, or Google Analytics, or YouTube, or any other service that embeds resources from Google. However, and I want to stress this again, the personal data is not “Stefan Loesch downloaded Google Fonts”, the personal data is “Stefan Loesch downloaded Google Fonts whilst looking at https://webmd.com/what-are-the-symptoms-of...”.

In other words: if we were to not send the referred header a lot of this issue would be defused. Only sending the domain, eg https:/webmd.com, may not go all the way but it would alread by a significant improvement. The conclusion here is

The privacy issue with 3rd party domains is not that much that they are able to identify visitors, but that they are being provided the source of the visitor via the referred header.

The data is sent by the user, not the website; not sending referrers would be a great privacy choice

As I am not a lawyer I am not certain to which extent this is ultimately important. However, a layman’s reading of GDPR suggests that the model envisage there is that of a data controller who is typically the primary contact of the user the data and holds their data, and of data processors who are subcontracted by the data controller. GDPR stipulates certain conditions (eg consent, necessity) that must be fulfilled before data is passed on from the controller to a processor, or between processors.

This image does not apply here: when a website is using Google Fonts it is not sending any data to Google. Instead it suggests to the user – specifically their browser – to fetch a certain resource. The user – or rather their browser – is then free to follow this suggestion and fetch the resource, or not. If the resource is not fetched it may be that the website stops working, or that the user experience degrades, but other than that there is no obligation of sending that data.

Importantly the user – or rather their browser – is in full control of the process. Eg blocking requests to fonts.googleapis.com for example would prevent contacting Google for fonts, and blocking requests to googletagmanager.com would prevent Google Analytics from working. More importantly, choosing to not send a referrer header would shut down a lot of the information leakage that occurs.

This last point requires a more differentiated discussion. If I stop sending referrer headers when downloading Google Fonts, then the information Google receives from me downloading the fonts becomes essentially meaningless. If I stop sending referrer header whilst using Google Analytics however this has no effect: Google Analytics executes JavaScript code, and that code communicates the current URL (and more) to Google. As a general rule – when downloading passive resources (images, fonts, audio or video files) then cutting off the referrer header is pretty decent privacy. However, when downloading scripts this is a different story and it depends on what the scripts in question are doing. In particular, downloading tracking scripts like Google Tags or Facebook Pixel will enable user tracking, pretty much independently of browser settings.

To summarize the first point about control

The private data is not sent to the 3rd party by the website, but by the visitor (or rather, their browser); the visitor (or rather, their browser) is in full control of what data is sent where.

and the second one about referrer headers

Not sending referrer headers would greatly improve data privacy, even though it would only help with passive content (eg images, fonts, audio and video) but not help with tracking scripts

The Munich judgment is overly restrictive and ultimately applies the wrong level

We have seen above that users – or rather their browsers – are perfectly capable of controlling what Google sees when they download a passive resource (downloading the Google Analytics code is a different story). If the user / their browser acts in a privacy conscious manner, all that Google will ever learn is that a person, whom they can probably identified, was one of the thousands or even millions of people who downloaded say a font, so unless that font is not to unusual Google has no way of identifying the website it orginated from, let alone the URL. As argued above, considering this personal data is a stretch.

Website operators can have very good reasons to link to 3rd party resources, rather than downloading them.

They may be legally prevented from downloading them becasue of the licensing terms (this is not the case for Google Fonts however)
Linking to centrally managed copies ensures that they can be updated if need be (usually more important for scripts than for passive resources however)
Web traffic is not free, and hosting images or fonts or other relatively heavy data can come at a substantial cost

The point 3 here is the most important. It is true that for the long tail of small websites, traffic numbers are so low that bandwidth costs are absorbed into the per-month hosting fees. The reason for this is of course that most websites see very fee visitors, so aggregate bandwidth costs are low. However, quite a few webmasters have seen their cost skyrocketing after their site or a post theron went viral – and not hosting heavy content on the site can save 1,000s or even 10,000s of Euros if a site happens to take off.

Policy suggestions

[TO COME; SUGGESTION WELCOME; PING ME AT @ODTORSON]