s07e10: Tireless five year olds
0.0 Sitrep
Lunchtime, Wednesday October 2 and honestly the less said about what I’ve been able to wolf down for lunch the better, but I guess in principle you can definitely identify protein, carbs and some sort of vegetable in there so shrugg ¯\_(ツ)_/¯
This week is a Portland week - so a remote work week - and has involved a bunch of sitting down and concentrating and reading contracts, and then scribbling in the margins with comments and tracked changes. It feels like being a lawyer doing contract review again! (This is because, he reflects, he is actually doing contract review.)
And: I don’t know if it’s a recent thing, but it feels like ever since Word went subscription it’s been crashing more. I type at roughly 160 words a minute, so when Word crashed and “helpfully” loaded its recovered version, I’d lost ten minutes worth of work because… that’s what autosave defaults to in 2019? Which although wasn’t quite a thousand words worth, certainly felt like a lot of words and made me Quite Angry.
Software is dumb and on balance, should probably not be allowed as a thing.
(One of you is going to tell me I should’ve just been doing the whole thing in vim or emacs or whatever).
Anyway! Ten episode celebration! Double figures! I guess if you count the subscriber-only episode last week, I already hit that. But anyway, arbitrary milestone based on body morphology! Woo!
Here’s the requisite subscribe button:
… and I’m pleased to announce as a major innovation in button technology, a new button, and this is one where you can give a gift subscription! Huh!
It is also a red button, but I guess you can’t have everything and maybe this is just a Substack branding thing.
1.0 Some Things That Caught My Attention
OK, I’ll fess up. Many times, the things that catch my attention are actually Annoying Things and they make me annoyed enough to animate me into writing, almost as if I were some crazed early 2000s era warblogger. (I was not. But, speaking of which, who remembers people like Steven den Beste!) Huh. I was not aware he had passed away. Awkward.
Anyway, two things this episode:
1.1 Streetscore, or: how I learned to stop worrying and love machine learning
Spoiler: I do not end up stopping worrying and loving machine learning.
Via Dr. S. A. Applin, I saw a tweet from @hypervisible about Streetscore, an algorithm from MIT Media Lab. Streetscore “assigns a score to a street view based on how safe it looks to a human — but using a computer”.
Now, you may be thinking you already know what I’m going to write here, but let’s just go along for the ride and see what happens.
The Streetscore team at Media Lab team have an FAQ, and from there we learn these facts (or, I guess, “answers” to frequently asked “questions”):
Streetscore was trained on “3,000 street views from New York and Boston and their rankings for perceived safety obtained from Place Pulse — a crowdsourced survey.”
Place Pulse itself is a Macro Connections / MIT Media Lab project that asks people “which place looks safer?” Their website says they’ve collected over 1.5m clicks.
The team list for Place Pulse lists a current team of 3, and a past team of 8. Of the total 11 researchers, none of their biographies list an interest in subjects such as anthropology or sociology. Or the ethics of building such systems.
Why did the team build Streetscore? Because “it allows us to scale up the evaluation of street views for perceived safety by several order of magnitude when compared to a crowdsourced survey.” In other words, Streetscore would let the team produce perceived safety maps where there is no crowdsourced data.
The team uses computer vision and machine learning to “extend the idea of predicting emotional response from images to predicting perceived safety” and to “generate map visualizations [of] the perceived safety of thousands of street view images from the same city.” A rationale used for this is that “these visualizations are highly useful for urban science research”.
The team says Streetscore is useful for research in urban science because it would let people “study the dynamics of urban improvement and decay.” It can also be used to “study the spatial segregation of the urban environment (a form of urban inequality), and also to “explore the determinants of urban perception.” It can also “empower research groups working on connecting urban perception with social and economic outcomes by providing high resolution data on urban perception.”
The team says Streetscore is 84% accurate at sorting images that score less than 4.5 from more than 5.5.
Streetscore gives “low rankings” for images that might look “quite safe to me” because it “can predict low rankings when evaluating images with visual elements it has not encountered in the training dataset, which consisted of only 3,000 images from Boston and New York City”. The team says these errors emerge in part “because our training dataset is not comprehensive enough for Streetscore to learn all of the visual variation found in urban environments.”
There is a last answer to a question about perception of safety connected to crime which includes (to might great non-surprisal) a reference to Broken Windows theory.
There’s a lot going on here!
First: the team used Streetscore to create high-resolution maps based on perceived safety data, and the first great honking alarm klaxon that should be going off is the perceived safety part, and in particular may I suggest that one start wondering well, whose perceived safety. Because it’s only a hop, skip and a jump to get to finding out that the data was from Place Pulse… a crowdsourced set of data. Which… means there are inherent biases in the dataset used from Place Pulse, right? It’s the perceived (reported perceived!) safety from whomever happened to be driving by Place Pulse and decided to have a go at rating perceived street scene safety.
Now, I’m all up for crowdsourcing, me. It’s great for things like, I dunno, finding out more about what received healthcare costs actually are. Or for looking through a bunch of documents. Or for spotting coronal mass ejections from the sun and classifying sunspots.
But, you know, when we’re talking about street scene safety, and especially perceived safety, then… maybe the people who’re doing the perceived safety matter?
Anyway! Here is the paper from 2014: Streetscore - Predicting the Perceived Safety of One Million Streetscapes and let me just say that the paper presents a set of “high resolution maps of perceived safety for 21 cities in the Northeast and Midwest of the United States at a resolution of 200 images/square mile, scoring ∼1 million images from Google Streetview”.
High resolution maps from a training set of 3,000 images. From New York and Boston.
OK, great. The FAQ pointed out that Streetscore sometimes produces low rankings when it encounters visual elements not in the training dataset, “because our training dataset is not comprehensive enough for Streetscore to learn all of the visual variation found in urban environments”. Yes. Because your training set was 3,000 images.
But, you know, that’s OK! Because the authors point out that the high resolution maps produced by Streetscore “should be useful for urban planners, economists and social scientists looking to explain the social and economic consequences of urban perception.”
I mean, should they be? They’re from an extremely limited dataset! How homogeneous are American cities, anyway?
Look. Let me do a thing that I have done before, which is to count citations and try to categorize them, which is not a thing I should really do because I am not a professional researcher and I do not know much about coding (no, not that kind of coding, the other kind of coding).
Streetscore’s paper has 27 references. Let’s break them down:
14 of the references are to journals or proceedings with “computer vision” in the title, e.g. IEEE Computer Vision and Pattern Recognition, European Conference on Computer Vision and IEEE International Conference on Computer Vision and so on. These make up over half of the references.
3 of the references are to machine learning journals, e.g. Advances in Neural Information Processing Systems and Neural computation.
Only 9 references in the paper have anything to do with urban planning, sociology or public health, and they are:
D. A. Cohen, K. Mason, A. Bedimo, R. Scribner, V. Ba- solo, and T. A. Farley. Neighborhood physical conditions and health. American Journal of Public Health, 93(3):467– 471, 2003
A. Dulin-Keita, H. K. Thind, O. Affuso, and M. L. Baskin. The associations of perceived neighborhood disorder and physical activity with obesity among african american ado- lescents. BMC public health, 13(1):440, 2013
P. Griew, M. Hillsdon, C. Foster, E. Coombes, A. Jones, and P. Wilkinson. Developing and testing a street audit tool using google street view to measure environmental supportiveness for physical activity. International Journal of Behavioral Nutrition and Physical Activity, 10(1):103, 2013
K. Keizer, S. Lindenberg, and L. Steg. The spreading of disorder. Science, 322(5908):1681–1685, 2008.
M.A.Kuipers,M.N.vanPoppel,W.vandenBrink,M.Win- gen, and A. E. Kunst. The association between neighborhood disorder, social cohesion and hazardous alcohol use: A national multilevel study. Drug and alcohol dependence, 126(1):27–34, 2012.
K. Lynch. The image of the city, volume 11. the MIT Press, 1960.
J. L. Nasar. The evaluative image of the city. Sage Publica- tions Thousand Oaks, CA, 1998
P. Salesses, K. Schechtner, and C. A. Hidalgo. The collaborative image of the city: mapping the inequality of urban perception. PloS one, 8(7):e68400, 2013
and, I shit you not, the fucking Broken Windows citation:
J. Q. Wilson and G. L. Kelling. Broken windows. Atlantic monthly, 249(3):29–38, 1982
I get it. This paper was from 2014, a, um, altogether different time in human history where we were unable to comprehend exactly how automated systems might otherwise encourage, perpetuate or support systems of inequality. But it’s striking to me that the paper uncritically evaluates the utility of crowdsourced data, instead saying that: “The good news about crowdsourced studies is that they provide an ideal training dataset for machine learning methods building on scene understanding literature in computer vision.”
I mean, sure the paper does cover potential bias. The authors cite The collaborative image of the city: mapping the inequality of urban perception as a defense, that subjective perception of safety is not “driven by biases in age, gender or location of the participants, but by differences in the visual attributes of images.” Is it just me, or is race not missing from that sentence?
Here is the thing: imagine a giant thinking-face-emoji on the sole of my boot, stamping on things like this.
I am just not sure that it is responsible to create a dataset for 21 cities, based on 3,000 images from two cities, while simultaneously admitting that the model under-scores images due to features missing from the admittedly anemic dataset and that the datasets should be useful. Forgive me for being a bit weird, but would it not be easier to ask people how they feel about the perceived safety of their neighborhoods?
There is a bit about scaling here. In the abstract, the paper talks about the need for data about the appearance of a neighborhood and that while crowdsourcing is good, it only has a limited throughput. A phrase like a limited throughput is roughly equivalent, in my mind, of it does not scale, and one thing I feel like I’ve learned lately is it does not scale means something like this is hard and we would like to cheat, or for it to be easier, or for it to be less expensive. It feels like there is less rigor involved, and that scaling or increased throughput via the novel application of machine learning and image classification techniques is just code for creating more data.
I am disappointed that the paper doesn’t go into more detail about any need to diversify the scope of its crowdsourced data. I am frustrated that the conclusion to the paper has a sentence like “a machine learning method trained using image features from a small dataset of images from New York and Boston labeled by a crowd, can be used to create ‘perception maps’ of 21 cities from United States at a resolution of 200 images/square mile.” because it does not feel like there is any rigour in examinig what “create ‘perception maps’” means.
I start thinking about things like this: you’re using a machine learning method to create a perception map that is supposed to simulate what a people-sourced perception map would look like. So how do you evaluate the performance of the predicted perception maps for cities that are not New York or Boston? The authors think that - intuitively - “the external validity of the predictor would depend on the similarity between architectural styles and urban planning of the cities being evaluated” but admit that there is no such quantitative data on those axes. So instead what do they use? The average median income of a city as a naive metric!
Again. I am not a research scientist. I did not go to MIT. I have, in all likelihood, not read the paper or understood it properly, but let me have this reaction which is: what the fuck?
Ugh.
1.2 Tireless five-year-old children
Here’s an analogy about machine learning that might be helpful based on the above poorly structured rant.
I have two young children: one six-and-a-half-year-old child, and one three-year-old. Even before I had children, I was interested in how children learn. Now that I have two to experiment on, it’s pretty awesome and I get to run a super interesting longitudinal study. But I digress.
Streetscore, above, is based on a training set of around 3,000 images. It can purportedly predict the safety of arbitrary street images.
When my kid was five, he could totally tell you whether things were vehicles or not. He’s pretty good at makes of cars now, which he uses to full effect telling us exactly which brand we should buy, because we are obviously made of money and buy cars all the time. (We do not.)
Anyway. My kid is good at recognizing cars because he has been exposed to pictures of cars and vehicles and we have helped him recognize cars and vehicles and he pays attention when we talk. He may have seen 3,000 pictures of cars, who knows! But, he is pretty good at recognizing them.
I feel like many applications of machine learning and artificial intelligence right now, mostly in image classification (is this a dog, is this a hotdog, is this person a good fit for this job, is this person about to steal from your front door) are the equivalent of chaining up a bunch of tireless five-year-old kids in front of webcams and asking them to tell you what they see.
Only, of course, instead of letting my kid tell me that we should buy a land rover, daddy, we’re hooking these artificially intelligent five-year-old kid image classifiers to the systems that actually buy cars and I feel like if you put it this way, what are you, insane?
I mean, if I told you I was outsourcing much of my hiring decision-making to asking my six year old if a candidate’s headshot looked smart or not, what would you tell me to do? And my six year old can do a better job of explaining why!
Kids are also super great because they do not get confused if a couple of pixels are slightly off in shade and tell you that this car is actually a banana, or that this bus actually is a person and you should brake now.
And look, yes, I get it. It’s not quite like how kids learn, but let me tell you a story about how in some cases it is quite frighteningly like how kids learn.
We live in Portland, Oregon, in a historically black neighborhood that is being gradually gentrified. It was being gentrified before we got here nearly 10 years ago, it is significantly more gentrified now. Our kids, Portland being Portland, hardly see people who aren’t white. Portland turns out to be pretty racially segregated for some weird reason. But, eldest kid knows one of the African Americans who lives on our street and comes round and does yard work for our neighbors and he is very friendly and happy to see… let’s call him, Mr. Mike.
So we’re out having dim sum one day, and we happen to walk past a Gamestop and there is a big poster on the front door for CALL OF DUTY BAD GOOSE SIX or MEDAL OF GOOSE HONOR only it’s not those things, it’s actually MEN SHOOT EACH OTHER IN THE FACE, and eldest says, Hey look! It’s Mr. Mike!
Only it’s not Mr. Mike, is it, because it’s just a black man on the poster and he’s only seen significantly fewer than 3,000 black men, and my partner and I look at each other and realize, again, that we’ve got to do something different and I say to eldest: OK, look: that’s not Mr. Mike. “But he looks like Mr. Mike!”. “Yeah. About that…”
Yes, transparency in where machine learning is being used and what it’s being used for and maybe just maybe it can explain its decisions (good luck getting a bureaucracy to explain its decisions), but for the love of god, transparency in training datasets, please.
2.0 And Finally
A few small things:
Lívia Labate wrote a great thread the other day for people who work with (against? alongside? grudgingly? in a sense of uneasy détente?) with Jira about stuff her team does that makes using Jira bearable. I am using Jira in the day consulting job, and… well, I have opinions. One quick opinion is wanting someone to explain to me whether Enterprise Software means “slow” and “gets irritating when you want to type things into it”.
Via @EmilyGorenscki’s observation that “Half of my fucking exercise these days is from marching to oppose some goddamn fascist bullshit or some other”, I had the thought that someone would inevitably do some positioning like 10,000 steps a day to fight fascism, which @kateallday immediately translated into Scots. I know there are still ad agency, planning and strategy people reading this so I’M WATCHING YOU.
I completely forgot about Tweetdeck’s existence (but do remember when it first came out, as its own thing), because I’m not really a big Twitter user (really?!) but was reminded about it again recently (Warren Ellis mentioned it in his newsletter, Tom Coates mentioned it to me in some conversation), and I remember trying it out again the other day. I had the random thought that Tweetdeck does offer so much more control and customization over Regular Twitter Dot Com, that… maybe more people could benefit from using it? Many complaints (and they are valid, I feel) about Twitter are about its primary interfaces, and I feel that some of them are probably addressed by what Tweetdeck lets you do. I think it’s a sign of my exposure to branding and advertising that I’m wondering whether a name change to Twitter Pro would actually be net helpful. (I know! It doesn’t actually fix any underlying problems! Those still need to be addressed! And yet!)
Nature has a 40-year review of The Hitchhiker’s Guide to the Galaxy, of which a) I am always up for H2G2 references; b) I am aware of this being a rather unrepresentative and non-diverse pov to keep coming back to; c) and I did just mention Genuine People Personalities the other day.
This TouchID bug that still exists in iOS 13.1.2 is a fucking disaster and Apple should be ashamed.
The general observation that if CNN and other news networks can decide to a) not cover a Trump rally live; b) cut away from a live Trump press conference or interview because he’s just… being batshit; then there is no excuse and it is as clear as ever that Twitter is deciding to keep Trump’s account alive. “We can’t get rid of his account because he’s newsworthy” does not fly when established news sources (without commenting on their journalistic integrity, etc.), but at least perceieved established news sources operating as such in society have already decided they don’t need to give him a platform all the time. In summary: fuck Jack Dorsey.
Untitled Goose Game is still amazing; here’s an interview with its creators.
Look, I get these are long. But… they also don’t take that much time to write? And also, if you’ve been reading a while, you’ll know that I’m not so much writing these for you, or an audience, but writing for myself to think and to organize thoughts. You, uh, just happen to be collateral damage. (Which apparently some of you are very happy to be!)
End of lunch. I even remembered to drink some water. Have you drunk any water in the last three hours? There you go, health tips and everything. My, aren’t we growing responsible in our… older-than-before age.
Hope you’re well, and do send me notes because I do read them and reply to them.
Best,
Dan
PS. Oh right, a button - one more episode this week for subscribers only, and I’ll be back next week.