s12e12: It's still about Ethics in Dataset Collection, Attribution and Licensing, but also Governance
0.0 Context Setting
It's Tuesday, May 31, 2022, and a sunny day in Portland, Oregon. Highs are forecast to hit 25 Celsius / 77 Fahrenheit with no rain.
Listening to: TRON: Legacy Reconfigured (Spotify, Apple Music).
1.0 Some Things That Caught My Attention
It's still about Ethics in Dataset Collection, Attribution and Licensing, but also Governance
I wrote in s12e09 about ethics in dataset collection, attribution and licensing in machine learning, in that particular case about the Berkeley Crossword Solver and how it was trained on scraping of about six million question/answer pairs, and do I have some Updates.
First: the question/answer pairs are not protected by copyright. The authors note this:
Unfortunately, complete crossword puzzles are protected under copyright agreements; however, their individual question-answer pairs are free-to-use.
To which there's an entirely separate point, outside the scope of this particular paper, I realize, which is: yes, that is unfortunate. All that means is that it would not be easy to use the complete crossword puzzles. The implication is that it's a bother to obtain consent, and the follow-on that it is much easier (and faster) to create and use a dataset for which the authors did not need to obtain consent.
Move fast and train models, as it were.
Yes sure, it's out of the scope of the paper, but I just want to cover this next part:
2.1 Collecting Question-Answer Pairs
We collected a dataset of over six million question- answer pairs from top online publishers such as The New York Times, The LA Times, and USA Today.
There's just so much missing in the "we collected" part. You don't want to repeat yourself, sure (i.e. we wrote a bunch of scraping software and then cleaned up the data and yadda yadda) and yet the actual mechanics are "we scraped a bunch of data".
Leigh Dodds wrote about a related issue in 2020 in The importance of tracking dataset retractions and updates. What happens to AI research and the use of models when the datasets they are trained on are removed due to ethical concerns?
I'm not surprised that the answer is "they are still used because once data is out there, it's out there" and it's not like there's a DMCA takedown process for retracted datasets. I mean, it's not like they're music.
Leigh links to a post on Freedom to Tinker about how a dataset of videos was taken down, but was still used in at least 135 papers published after the dataset was retracted.
I bet the Web3 people will say that web3 would fix this, and I will say again: no it wouldn't, not unless you outlaw general purpose computing. Also, if I keep going, I will invoke Cory Doctorow again.
The point in general is that yes, there's a deal with licensing and I recognize that licensing in theory could work to stop some misuse at source, and could also enable use through carving out explicit permissions. But ultimately, the whole problem is one of governance (thanks, Leigh!) which is another way of saying "what are the consequences if you break the rules". Now it's hard to do anything but roll my eyes at this point because apparently the mere concept of a consequence for a transgressive, socially unacceptable or plain illegal action these days for certain powerful actors in society is just laughable, see: Clearview AI being fined GBP 7.5 million which is technically a consequence, but is it supposed to be a deterrent? Thanks to the way we think about economics these days, it's merely a cost to be factored in. And, you know, we can always agree to disagree, or just point blank ignore the spirit of the consequence, by saying something like the UK Information Commissioner's Office has "misinterpreted my technology and intentions"2.
I don't know what to say here other than the usual standard protests, so just assume I have written eloquently along the lines of:
- insufficient understanding and expertise in government to properly and deliberately regulate the use of data and its application, never mind in AI or other technologies that operate at scale
- something about asymmetrical operation at scale which is that tech allows you to map:reduce on society, but that undo operations on individuals are incredibly costly in comparison (i.e. I don't know, the marginal cost of copy-paste of actions on millions of people is tiny, the cost of fixing any mistakes is off-the-charts in comparison)
- what feel like young/immature (not in a bad sense, just... not... old? Not experienced?) methods for the engagement of society in these issues and advocating a societal/governmental position other than again the usual suspects like the EFF and (relatively) newer ones like (off the top of my head) Data & Society and the UK's doteveryone who are building relationships in government and civil service
- potentially the lack of experience and capability in political representatives' offices (but see also Senators like Ron Wyden, who is in the tiny minority of US Senator who has fantastic staffing in this area and appears to give a shit)
- the lack of regulation and attention in this area betrays the possible attitude of government in general, which is "does not give a shit provided it's not too painful" and not anything proactive
- Oh, I don't know, something I read in some sort of UK Conservative party manifesto about "leveling up" that probably talked about the UK being a powerhouse for tech post-Brexit, of which I can't, as it were, even.
Asymmetrical operation at scale
A placeholder for me to remember to write later about asymmetrical operation at scale, which is what happens when I write these newsletters as thinking-out-loud and stream of conscious thought.
Another though: what happens when tech allows you to map:reduce on society, or, rather, bureaucracy is the implementation of the impersonal and politics is always personal.
Three quick shorts
-
40 Years of Black Hole Imaging (1) by Jean-Pierre Luminet is what it says, and covers how astrophysicists started developing ideas of what black holes look like back in 1970. Caught my attention because: you can barely go the distance light travels in ~3.34*10e-9 seconds without seeing the visualization of a black hole that hit popular culture in Interstellar. But in the very second diagram in that blog post, you can see how James Bardeen and C.T. Cunningham had figured out the sort of glowing fried egg shape.
-
The BBC Micro 🦉 Bot by Dominic Pajak is a Twitter bot that replies to your BBC BASIC code with a screen recording of that code running. You could tweet wonderful computer art (here's a Mandelbrot), or you could be like me, and simply get it to print one of my favorite space ship names. Caught my attention because: feels like it's next to Will It Run Doom?1
- Jeff Gothelf's article on the anti-pattern of creating OKRs to fit your backlog wandered into one of my feeds and caught my attention: not least because it's another reason to go back to Christina Wodkte's work on OKRs, but also because it's an illustration of how hard people will work to not have to make difficult decisions, namely: we will have to not do some things, which will probably involve disappointing someone.
I continue to be tired.
Thank you to everyone who's replied, from the "hi"s to the longer notes.
How are you doing?
Best,
Dan
-
Making things that have processors in them run Doom is, like, step 2 for some people after "Hello, World". Or maybe step 3, if your steps are (1) Bring-up, (2) Hello, World, (3) Make it run Doom. There is, as always, a subreddit, r/ItRunsDoom. ↩
-
The walls are closing in on Clearview AI, Melissa Heikkilä, May 24, 2022, MIT Tech Review ↩