Do Neural Networks Ever Forget? 🧠

How machine learning throws a wrench in the 'right to be forgotten.' Bringing in some of the latest computational research on privacy, this post examines how the principles of GDPR collide with the realities of neural networks. Politics logo

Politics

Jun. 4, 2020

Unintended Feature Leakage from Gender Classification. Source: [Melis, Luca, et al.](https://ieeexplore.ieee.org/abstract/document/8835269/?casa_token=xWJF2Qn5p04AAAAA:8onczj50twpsKTaybecxy-CIAIgSRoWJ5NeJ9p0hMw53pP3t5JHJjkjpeF7wd4FLRZzd9XgnoFw)Unintended Feature Leakage from Gender Classification. Source: Melis, Luca, et al.

As the usage of data evolves, so should its regulation. Faster and faster, the digital world is embedding itself in our lives to remove friction. Tech removes friction by learning about us and how we behave as a collective, anticipating and reacting accordingly. Think Starbucks sending you a push notification whenever you come close to one of their stores — one ad for a latte if it’s cold out, one for an iced coffee if it’s hot. This has made firms like Facebook, Amazon, Apple, Netflix, and Google some of the most valuable (the most valuable, bar none, if you consider how few employees they have) in history, giving them an out-sized influence on our lives. So it is important to ask: who are these firms accountable to? Or more importantly, what are the market forces that affect how we, their users, are treated? Facebook’s misuse of data with Cambridge Analytica, and Google’s rogue engineer who adapted a fleet of Street View cars to siphon often-sensitive data from private WiFi networks, have lead to reasonable concerns surrounding how much regulation is needed in tech. Unfortunately, when it comes to protecting our data, privacy legislation fails to take into account artificial intelligence (AI). Instead legislation, like the EU’s General Data Protection Regulation (GDPR), focuses on the explicit collection and transfer of personal information. This ignores what makes data useful to tech firms, how it can be generalized and modeled to commodify everyday behaviour. In this way, machine learning (ML) undermines traditional privacy legislation in twice over: it complicates an our right to access and appeal how organizations use our personal information, and it ignores how ML makes implicit use of personal data.

This argument is a little more nuanced than pointing out the consequences of a world where training data can be reverse engineered, though, this is also a concern. Instead, I want to focus on what privacy legislature attempts to protect: our ability to know how companies use our data, and our ability to maintain control of our data. In doing so we’ll see that ML makes it harder to interrogate how companies use our data. We’ll also see that correcting how our data is used in these systems is much harder than correcting the data that is protected in GDPR. Finally, I’ll make the argument that if our aim is to give greater control over how our data is used, then the right to be forgotten must also apply to ML. Otherwise, we will be ignoring what Shoshana Zuboff calls tech’s new “logic of accumulation.”² If you aren’t a fan of Zuboff, or the term “logic of accumulation” is foreign or off putting, hold on— that’s where we’ll start. To cap things off I’ll put a spotlight on some of the latest research that aims to address these problems.

What Makes Our Data Useful?

Before we can understand how GDPR fails to protect the use of our data, we need a better understanding of what the connection is between tech firms and our personal privacy. The rapid rise of connectivity and proliferation of uses of the internet has brought about what Shoshana Zuboff considers a new technological logic of accumulation, where big data “organizes perception and shapes the expression of technological affordances at their roots.”² This is an overly academic way of saying that big data has changed how we view the world — and as a result the way firms, like Google, operate is fundamentally different from non-data oriented firms. Private organizations are able to gain a deep knowledge of our online interactions “from above”, anonymously monitoring everyday behaviour to model and exploit whatever information they can gleam². Through continuous data mining and analysis, the Google’s and Facebook’s of our world are able to understand how we behave at a tremendously granular level⁸. The digital bread crumbs we leave behind are collected, stored and then aggregated and modeled to better target, personalize, and enforce. This is what researchers refer to as “the commodification of everyday behaviour.”² Tech act as indifferent observers who spread their “free” products as widely as possible, to model our behaviour for the benefit of advertisers, insurers, etc. This digital-/data-first process has produced relatively small firms, with fewer fixed costs, that generate tremendous amounts of wealth. And thanks to the unique corporate structures of Facebook and Google, the ability to leverage those assets are often directed by one or two people.

The onus lies on policy makers to ensure that technology’s advances are brought about in an equitable way. A holistic privacy policy is necessary; legislation must allow for the fair, transparent collection of data, as well as ensure that data are processed and utilized in an equitable way. While tech firms require near ubiquitous monitoring to produce the lakes of data it feeds off of, their true value come from the ability to process and make data useful. Given the enormity of data, this is only made possible through ML. Theoretical and practical advances in ML let tech firms search, sort, cluster and make decisions based off subterranean patterns in data. Therefore, collection and utilization are inextricably linked. However, this is not how our privacy legislation has viewed data collection. Instead, policymakers have generally focused on the former, without recognizing how data collected is exploited implicitly in its utilization.

What GDPR Does (and Doesn’t) Do

Private firms — leveraging largely public datasets -– fundamentally altered the United Kingdom’s referendum to leave the EU, and the 2016 US election of Donald Trump³. In my opinion these major events are what brought questions about how our data is used into the public consciousness. Between the week’s starting April 10, 2016 and April 10, 2019, Google search interest saw increases of 119%⁴, 1,566%⁵ and 81%⁶ for the search terms: data privacy, AI ethics, and privacy software respectively. In this same timeframe, Google search interest surrounding artificial intelligence and machine learning also saw a steep uptick with corresponding 43% and 200% increases⁷. So it is not surprising that the largest piece of privacy legislation born in this political landscape, the EU’s 2016 GDPR, has been the subject of popular debate and scrutiny. GDPR regulates the processing and free movement of data, and affords individuals the “protection of [their] personal data” through three core sections: the right to informed consent, the right to access personal data, and the right to rectification and erasure⁸. GDPR gives increased protections to individuals, letting you appeal decisions made by autonomous systems. Unfortunately, by focusing on the explicit collection and movement of data it falls prey to similar flaws in earlier pieces of privacy legislature like Canada’s PIPEDA⁹. These flaws, which we’ll go into depth on, are that it can be very hard to: access our data after it’s been processed to train a neural net as well as appeal its uses given the often opaque nature of ML. As well, the right to rectification and erasure fails to take into account ML is structured by the data it’s trained on. This allows companies to profit off of our data long after we’ve requested it to be erased.

How do you fight an algorithm?

GDPR (Article 16 specifically😉) gives us the right to appeal inaccurate collection or use of our personal data. But while the “purposes of processing” must be taken into account when rectifying inaccurate uses of data, GDPR fails to establish a litmus test for what constitutes inaccurate usage⁸. Knowing what needs to be fixed is much easier when the data in question relates to some concrete characteristic. It’s easy to fix someones name or birthday in a database. However, in cases where an ML system makes some decision about us — like inferring our political orientation, sexuality, or risk of recidivism— how can you appeal to a neural network? This is key because the data that ML is trained on are produced in an unjust world, and there is often little reason to believe that such models will do anything but replicate preexisting inequalities¹⁰. It is what researchers often refer to as algorithmic bias (which is different from statistical bias). This was the case when researchers from Microsoft and Boston University demonstrated that word embeddings can exhibit gender stereotypes to disturbing extents¹⁰. However since even supervised ML is left to its own devices to figure out how to best approximate some regression/classification function, it is less straightforward to argue that you have been discriminated against¹¹. By placing the burden on individuals to meet this vague standard for what inaccurate usage may mean, GDPR ignores the structural biases that are easily replicated and amplified in ML¹¹.

Do Neural Networks Ever Forget?

GDPR deviates from previous attempts at privacy legislature by giving individuals the “right to be forgotten.” This means that if you make a request to a company that has possession of your data, they are obligated to erase it. However, this doesn’t extend to the ML which your data has been trained on. This is because GDPR fundamentally views data as an input to a machine that makes some decision, when actually, data shapes the decision making system itself. To me, by allowing companies to continuously profit off of our data regardless of individual preferences, this represents a fundamental flaw that ignores tech’s logic of accumulation.

GDPR (Articles 17 through 20 now💃) doesn’t recognize that if your data has been used to train a neural network, you are forever imprinted on it¹². Even if you submit an erasure request, and your information doesn’t appear in any of Facebook’s databases, your information is still implicitly being processed when Facebook decides what ad to show someone. This is what brings us back to our header image. Researchers at Cornell, UCL and the Alan Turing Institute recently demonstrated that collaborative learning models can “leak unintended information about participants’ training data,” allowing malign actors to “infer the presence of exact data points—for example, specific locations [… as well as] properties that hold only for a subset of the training data and are independent of the properties that the joint model aims to capture.”¹ This, hopefully, drives home the fact that ML is not separate from us, and there is a growing body of literature that argues our data shapes the fundamental structure of these models. In some cases, this literally means adding/dropping nodes from the layers of an ANN¹³. In framing erasure in such concrete terms, GDPR fails to remedy tech’s more exploitative characteristics and refuses to acknowledge the true utility of data: that it “records, modifies, and commodifies everyday experience.”²

Improving the ‘right to be forgotten’

GDPR gives us the right to challenge companies when they use ML to make decisions about us (what price to give, whether to insure, risk of recidivism). This is a huge step forward. Unfortunately, ML’s quality is to disappear into the background, embedding itself in our digital world. That is to say, there is rarely a big sign saying: “Watch out! A neural network is deciding whether you’re too risky to insure!” Given the embedded nature of ML, its implementation can subtly shape the online world in ways that, while technically consensual, individuals are not fully aware of. This then puts the onus on individuals to parse out their online world for inaccurate or biased systems in ways that could be far from feasible. It also ensures that only those who have the means to educate themselves on how tech/ML operates will be able to have full control over their data. Over the past sixteen years, there has been surprisingly little to fully address the issue of clear and informed consent.

The most overlooked aspect of privacy legislature is that there are no protections to let individuals remove themselves from models that infer from user data¹⁴. GDPR does not address the consequences of allowing tech to profit off of models, trained on our data, after we’ve invoked our “right to be forgotten.” This requires a conceptual shift in how privacy is viewed. The core of tech firms are their ability to cheaply capture data, the raw material, and model it to achieve various ends. Privacy legislation cannot stop at the collection of data and then interpret the neural net from which it was built as something wholly different. Privacy legislation should extend to ML as well. As of now this problem is only addressed superficially in GDPR. Thankfully, researchers at the University of Cambridge and Queen Mary University of London, among others, are proposing technical solutions to these problems. Shintre et al. proposed a novel solution that allows individual data points to be removed from artificial neural networks in their 2019 paper, Making Machine Learning Forget¹⁵. What this demonstrates is that there are few technical obstacles to fully realizing systems where we can truly have the right to be forgotten. First however, there must be an understanding of how our data can be used and misused, and the political will to hold tech accountable.

Moving Forward

Technology and ML have undoubtedly made our lives better. However, that doesn’t mean we shouldn’t be critical when tech firms unnecessarily impinge on our rights. Bezos would still be rich if we addressed these issues. In the past 20 years, the tech industry has accumulated massive amounts of user data which legislatures have subsequently had to grapple with. ML undermines the existing forms of privacy legislature in two ways. It subverts the grounds from which we can appeal inaccurate or biased uses of our personal information. As well, problems arise when users are afforded the right to erasure without acknowledging the embedded nature of data in ML. As a result, We need a conceptual shift in how we view privacy.

Our data isn’t useful by itself. Given that fact, we need to focus less on the explicit collection and transfer of data, and instead focus more on how our data is used. Our data leaves fingerprints on the neural networks they’re trained on. It’s important to remember that those fingerprints are ours as well, and as a result the right to be forgotten should extend to ML.

[1]: Melis, Luca, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. “Exploiting unintended feature leakage in collaborative learning.” In 2019 IEEE Symposium on Security and Privacy (SP), pp. 691–706. IEEE, 2019.

[2]: Zuboff, Shoshana. “Big other: surveillance capitalism and the prospects of an information civilization.” Journal of Information Technology 30, no. 1 (2015): 75–89.

[3]: Isaak, Jim, and Mina J. Hanna. “User data privacy: Facebook, Cambridge Analytica, and privacy protection.” Computer 51, no. 8 (2018): 57.

[4]: Google Trends, “Data Privacy Search Interest (2016–2019).” Accessed on April 10, 2020. https://trends.google.com/trends/explore?date=2016-04-10%202020-04-10&q=Data%20Privacy.

[5]: Google Trends, “AI Ethics Search Interest (2016–2019).” Accessed on April 10, 2020. https://trends.google.com/trends/explore?date=2016-04-10%202019-04-10&q=AI%20Ethics.

[6]: Google Trends, “Privacy Software Search Interest (2016–2019).” Accessed on April 10, 2020. https://trends.google.com/trends/explore?date=2016-04-10%202019-04-10&q=Privacy%20Software.

[7]: Google Trends, “AI and ML Search Interest (2016–2019).” Accessed on April 10, 2020. https://trends.google.com/trends/explore?date=2016-04-10%202019-04-10&q=Machine%20Learning,%2Fm%2F0mkz.

[8]: General Data Protection Regulation, European Parliament2016, 1–77.

[9]: Personal Information Protection and Electronic Documents Act, Revised Statutes of Canada 2000, 4–39.

[10]: Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. “Man is to computer programmer as woman is to homemaker? debiasing word embeddings.” In Advances in neural information processing systems, pp. 4350. 2016.

[11]: Waldman, Ari Ezra. “Power, Process, and Automated Decision-Making.” Fordham L. Rev. 88 (2019): 613.

[12]: Kamarinou, Dimitra, Christopher Millard, and Jatinder Singh. “Machine Learning with Personal Data: Profiling, Decisions and the EU General Data Protection Regulation.” Journal of Machine Learning Research(2017): 1–7.

[13]: Golea, Mostefa, and Mario Marchand. “A growth algorithm for neural network decision trees.” EPL (Europhysics Letters) 12, no. 3 (1990): 205.

[14]: Kamarinou, Dimitra, Christopher Millard, and Jatinder Singh. “Machine Learning with Personal Data: Profiling, Decisions and the EU General Data Protection Regulation.” Journal of Machine Learning Research(2017): 1–7.

[15]: Shintre, Saurabh, Kevin A. Roundy, and Jasjeet Dhaliwal. “Making Machine Learning Forget.” In Annual Privacy Forum, pp. 72–83. Springer, Cham, 2019.