It’s Tuesday, May 31, 2022, and a sunny day in Portland, Oregon. Highs are forecast to hit 25 Celsius / 77 Fahrenheit with no rain.
I wrote in s12e09 about ethics in dataset collection, attribution and licensing in machine learning, in that particular case about the Berkeley Crossword Solver and how it was trained on scraping of about six million question/answer pairs, and do I have some Updates.
First: the question/answer pairs are not protected by copyright. The authors note this:
Unfortunately, complete crossword puzzles are protected under copyright agreements; however, their individual question-answer pairs are free-to-use.
To which there’s an entirely separate point, outside the scope of this particular paper, I realize, which is: yes, that is unfortunate. All that means is that it would not be easy to use the complete crossword puzzles. The implication is that it’s a bother to obtain consent, and the follow-on that it is much easier (and faster) to create and use a dataset for which the authors did not need to obtain consent.
Move fast and train models, as it were.
Yes sure, it’s out of the scope of the paper, but I just want to cover this next part:
2.1 Collecting Question-Answer Pairs
We collected a dataset of over six million question- answer pairs from top online publishers such as The New York Times, The LA Times, and USA Today.
There’s just so much missing in the “we collected” part. You don’t want to repeat yourself, sure (i.e. we wrote a bunch of scraping software and then cleaned up the data and yadda yadda) and yet the actual mechanics are “we scraped a bunch of data”.
Leigh Dodds wrote about a related issue in 2020 in The importance of tracking dataset retractions and updates. What happens to AI research and the use of models when the datasets they are trained on are removed due to ethical concerns?
I’m not surprised that the answer is “they are still used because once data is out there, it’s out there” and it’s not like there’s a DMCA takedown process for retracted datasets. I mean, it’s not like they’re music.
Leigh links to a post on Freedom to Tinker about how a dataset of videos was taken down, but was still used in at least 135 papers published after the dataset was retracted.
I bet the Web3 people will say that web3 would fix this, and I will say again: no it wouldn’t, not unless you outlaw general purpose computing. Also, if I keep going, I will invoke Cory Doctorow again.
The point in general is that yes, there’s a deal with licensing and I recognize that licensing in theory could work to stop some misuse at source, and could also enable use through carving out explicit permissions. But ultimately, the whole problem is one of governance (thanks, Leigh!) which is another way of saying “what are the consequences if you break the rules”. Now it’s hard to do anything but roll my eyes at this point because apparently the mere concept of a consequence for a transgressive, socially unacceptable or plain illegal action these days for certain powerful actors in society is just laughable, see: Clearview AI being fined GBP 7.5 million which is technically a consequence, but is it supposed to be a deterrent? Thanks to the way we think about economics these days, it’s merely a cost to be factored in. And, you know, we can always agree to disagree, or just point blank ignore the spirit of the consequence, by saying something like the UK Information Commissioner’s Office has “misinterpreted my technology and intentions”2.
I don’t know what to say here other than the usual standard protests, so just assume I have written eloquently along the lines of:
A placeholder for me to remember to write later about asymmetrical operation at scale, which is what happens when I write these newsletters as thinking-out-loud and stream of conscious thought.
Another though: what happens when tech allows you to map:reduce on society, or, rather, bureaucracy is the implementation of the impersonal and politics is always personal.
40 Years of Black Hole Imaging (1) by Jean-Pierre Luminet is what it says, and covers how astrophysicists started developing ideas of what black holes look like back in 1970. Caught my attention because: you can barely go the distance light travels in ~3.34*10e-9 seconds without seeing the visualization of a black hole that hit popular culture in Interstellar. But in the very second diagram in that blog post, you can see how James Bardeen and C.T. Cunningham had figured out the sort of glowing fried egg shape.
The BBC Micro 🦉 Bot by Dominic Pajak is a Twitter bot that replies to your BBC BASIC code with a screen recording of that code running. You could tweet wonderful computer art (here’s a Mandelbrot), or you could be like me, and simply get it to print one of my favorite space ship names. Caught my attention because: feels like it’s next to Will It Run Doom?1
I continue to be tired.
Thank you to everyone who’s replied, from the “hi”s to the longer notes.
How are you doing?
Making things that have processors in them run Doom is, like, step 2 for some people after “Hello, World”. Or maybe step 3, if your steps are (1) Bring-up, (2) Hello, World, (3) Make it run Doom. There is, as always, a subreddit, r/ItRunsDoom. ↩