I’ve been thinking about the paper and movie linked at the previous post. Have a look at that if you haven’t, because it’s neat.
Here’s what you might think about the movie. You might look at the clip on the left and the movie on the right, and think “Wow! those look pretty similar!” You might further think “Gee, they must look similar because visual areas contain a remarkably accurate map of what people are seeing, and these guys have figured out a way to show me that mapping! That’s cool!” (If you tend towards Fodorian crankiness, you might also think “Who cares? We knew this already! I’ve learned nothing!”).
Watch a little closer, though. Why do the elephants at 0:12 look like the inkblots at 0:06? Why does what appears to be a mattress and some text show up at 0:07 when there’s nothing like that on the left? Why does the African-American dude with the stethoscope at 0:20 suddenly turn into a distinctly un-stethoscoped white woman at 0:21?
There’s a good answer to these questions. Here’s the simple version: the movies on the right are not brain data. They are a bunch of YouTube clips superimposed on top of each other. Which clips were used are based on the brain data: they represent the 100 most likely clips from the test set according to their model. Since the model is quite good at picking clips that subjects had seen, there’s a lot of overlap. But you’re not seeing the “movie in your head” in any important sense.
Here is my best guess at the details, having puzzled through the methods section and the SOMs. (Some caveats: I haven’t had time to work through the details of the modeling. The motion-energy part of it would be beyond my pay grade anyway. It’s late, I’m on a diet, and so I’m ornery. So this is a pretty birds-eye view; if I’ve gotten anything obviously wrong, please let me know.)
Step 1: Take your movie clips, split them into short chunks, downgrade the heck out of them, and throw out chromatic information.
Step 2: Feed those movie-chunks into a fancy model designed to extract motion components.
Step 3: Take the results of step 2 and use them to convolve a hemodynamic response function for each movie-chunk. That gives you a prediction for the BOLD response in a voxel if it was seeing that chunk.
Step 4: In each voxel, determine the goodness of fit for each hemodynamic response function you got in step 3. That gives you the posterior probabilities for each chunk at each voxel. Then take the most discriminating voxels and combine those predictions to make predictions about which chunk was being viewed at that time.
It turns out that those predictions are impressively good. That is a cool feature of this paper: the BOLD signal is usually thought to be too sluggish to get any kind of fine temporal information out if it, especially with this sort of stimulus design. So I’m not downplaying what they did. It’s neat. I wouldn’t have thought it would work, but there you go. But where do the movies come from? So far as I can tell, each voxel is making a prediction about whole movie chunks. Well:
Step 5: Then take the movie clips—the original clips, not the brain data—and average them together, weighted by the posterior probabilities you got in step 4.
That’s why the movies on the right are all eerie and ghostly: each bit you’re seeing is an average of a whole bunch of YouTube clips. That’s why there are lots of odd artifacts: sometimes the model misfires, and so clips that have nothing at all to do with the original movie get included. That’s why it seems to work pretty well with talking heads: those are plentiful in their data set, so the model’s mistakes tend to be other talking heads. That’s why it doesn’t work so well with elephants and inkblots—the misfires tend to pick out unrelated chunks. That’s why the text bits are unintelligible but text-y. They’re using movie trailers, and the bits of text in each are (I’m guessing) pretty hard to tell apart using this kind of model, so it’s just kind of glomming all of them together.
In short, this is not reading movies off of your brain. This is using a complicated model to predict what movies someone saw, and then averaging together the original clips corresponding to the best predictions. The movies on the right are a snazzy, misleading way of representing the posterior probability distributions over the set of possible clips. Since many of the wrong guesses still match up well with the original stimulus, it partly validates the model’s ability to pick out clips with similar visual features at a relatively fine time scale. (More cynically, if they’d used only the top prediction of the model, which was usually pretty accurate, it would be obvious that they were just showing you youtube clips).
There’s a lot of cool stuff in the paper aside from the methods, especially about selectivity in V1. It’s an impressive proof of concept. But we’re still pretty far away from reading minds.