Tuesday, March 12, 2013

On Big Data

Just read an article called "How a Computer Program Can Learn All About You From Just Your Facebook Likes." Slashdot linked to it with a little more daring title: "Facebook Knows If You're Gay, Use Drugs, Or Are a Republican." The actual paper that these articles talk about is here, and while it's actually very readable for a paper, its title is "Private traits and attributes are predictable from digital records of human behavior." So...I'm going to go ahead and re-summarize it again here.

The basic gist of the article is that a bunch of dudes from Cambridge and Microsoft wanted to find out if they could find things out about people based solely on what they like on Facebook. Turns out, they can find out a ton. This is kind of scary because Facebook likes are completely public to everyone, even people you haven't friended. So companies and strangers can find out a ton about you, pretty much for free.

Here's the relevant chart that shows how strong Facebook likes correlate with certain traits:

For some traits, like gender and race, Facebook likes are nearly conclusive evidence. It's not quite so strong for things like drug use or lesbianism, but it's still relatively strong. Remember that random guessing should theoretically have .5 correlation, so while Facebook likes can point strongly in a direction for all of these traits, they're hardly a smoking gun for anything below a 0.8 correlation.

The paper also correlated other things like intelligence, age, number of friends, and personality traits like extroversion with Facebook likes with various amounts of success. I'm actually not quite sure how powerful the correlations are, but the paper assures me that the findings have a 99.99% chance of being statistically significant. So there's that.

The fun part is seeing which "likes" correlate with which traits. It's not exactly what you might think: after all, only 5% of actual gay people "like" things that are explicitly homosexual on Facebook. Also, if there's anything that Nate Silver and grad school have taught me, it's that we tend to overfit theories to data. So a lot of these might just be the result of bad training data. But some of them...well, they seem to make sense. Here's a few excerpts:

  • Higher intelligence correlates with liking "The Godfather," "The Lord of the Rings," "The Colbert Report," "The Daily Show," and "Science." Makes sense. But it also correlates with liking "Thunderstorms" and "Curly Fries." You never know, maybe there's something here...
  • Lower intelligence correlates with "Tyler Perry," "Harley-Davidson," "Sephora," and "Lady Antebellum." I'm not going to make any comments here.
  • Neurotic people like "Emo," "Dot Dot Curve," and "So So Happy." I don't even know what those are and it seems to make sense.
  • Homosexual males like "No H8 Campaign," "Mac Cosmetics," and "Human Rights Campaign." Makes sense. And "Wicked." Crap.
  • Heterosexual males like "Wu-Tang Clan," "Shaq," and "Being Confused After Waking Up From Naps." I kind of love that that last one's actually a thing.
  • Homosexual females like "Not Being Pregnant," "The L Word," and "Sometimes I Just Lay In Bed And Think About Life."
And so it goes. I dunno, I think I'm supposed to be scared by these articles into wanting more privacy, but honestly, I find this kind of stuff more fascinating than anything. Some people want to hide from Google or whatever...me, I want to write the program that crunches all this stuff.

-Tim

No comments:

Post a Comment