Oct 11, 2008

Scraping facebook email addresses

Last night I had a dilema - I came to realize that I don't even have a fraction of my friends email addresses in my contact book, which is a very bad thing by any means. Of course there's facebook for ya! but it's still no substitute for some good ol' emails.

So I thought maybe I could simply get them off of facebook - no go!

why? Facebook doesn't provide plain-text email adds, which presents a bit of a problem. After a little research, it became clear that FB uses one of those string-to-image scripts. Hah! easy I thought, I'll just decode the Base64 string and voila... as it happens it's not that easy. It's not a Base64 string and to be honest I couldn't figure out what it was. So that left me with the other option - OCR

This didn't prove too difficult at all. For the most part all I had to do was go through all my friends profile pages, extract the string_image hash, and pass that to

http://m.facebook.com/string_image.php?ct=XXXXXXX&fp=8.7&state=0

where ct takes the has, and fp is a float that controls the size of the output image. 8.7 is standard. you can crank that up to improve the OCR detection rate. I found 35 to be the optimal value between size and clarity.

based on this, i was able to whip up a quick bash script to take in a list of User-ID's (just a bunch of numbers that corrospond to a given user. Do what you will to grab that), grab the email image and use OCR on it. I used OCRAD to do the OCR, and imagemagick for convertion.

EDIT: It saddens me that some people have been making money off the code that I wrote. I helped you guys out in good faith. Really sucks that you took advantage of it. Anyway, I've decided to re-post the code here so the lamesters can be exposed for what they are. I'm posting the rewritten perl code here, since the original bash thing didn't work anyway.

NOTE: I have made some deliberate omissions here. modifications are needed before the code will be functional. you WILL GET BANNED by facebook if you overdo it.

here's the bash stuff:

BOING BOING!!! where did the code go? SOrry guys, I had to remove it.

and in perl: (the xxxx's should be easy to figure out if you see my other scripts)

#!/usr/bin/perl
use strict;
use xxxxx;
use xxxxx;
use Image::Magick;
use Shell qw[ocrad];

my $username = @ARGV[0];
my $password = @ARGV[1];
my $iurl;#temp var
my $id;  #temp var
my $x;   #temp var
my $uids="uids"; #path of uid list file
my $idlist="idlist"; #path of output file
my $size=35;  #size of email image to download

my $mech = xxxxxxx->new();
my $image = Image::Magick->new();

$mech->cookie_jar(xxxxxxxxx->new());

#login
$mech->post("https://login.facebook.com/login.php?m&next=http://m.facebook.com/inbox",{email=>$username,pass=>$password});

#star processing uids
open(UIDS,$uids);
open(IDLS,">>$idlist");
foreach $id ()
{
  chomp($id);
  $mech->get("http://m.facebook.com/profile.php?id=".$id."&v=info&refid=17");
  if(defined ($iurl=$mech->find_image( url_regex => qr/string_image.php/ )))
  {
    ($iurl=$iurl->url_abs())=~s/8.7/$size/;
    chomp($iurl);
    $x = $image->Read($iurl);
    $x = $image->Write(gamma=>0.3,colorspace=>'rgb',filename=>$id.".ppm");
    print IDLS "$id,".ocrad("$id.ppm")."\n";
    @$image = ();

   }
   else 
   {
    print IDLS "$id,undefined\n";
  }
}

close(UIDS);close(IDLS);

This works remarkably well for the most part, although ocrad did confuse some 1's for l's. I had better results with tesseract - but had to convert all the images to bi-tonal graymaps first. otherwise it's simply useless.

11 Comments:

Anonymous said...

isn't it sort of a bad idea posting this? script-kiddies, spam, end-of-the-world. Get my idea?

Anonymous said...

Bad...BAAD!

andhu said...

niiiicce... hmmm but i do agree with anon up there...but u could remove the script and post the article just not the entire thing???

Anonymous said...

thanks. I used this with some modifications to backup my contacts emails and then make a new facebook account and invite everybody to it. Very helpful. I wish facebook would provide this information by default.

Anonymous said...

they got to you huh? removing the code?

Anonymous said...

the thing that is interesting is ct.

ct is double base64.

the first part, is a 128bit checksum/md5 hash. followed by a double byte length. then the final bit is the length/8 blocks of code book encrypted data. "code book is like a lookup table, one block doesn't effect the next". However I don't know how to determine the encryption algorithm.

However the iphone interface. uses plain text to specifiy email addresses.

Anonymous said...
This comment has been removed by a blog administrator.
Stachendrath said...

dam script kiddie , please post the code ! ...

Unknown said...

sprnch.com has a scraper that does this as well

Anonymous said...

Hi there, can I peek at the script? I'm trying to learn this stuff.

Anonymous said...

My email addy is highrider778@gmail.com Let me know if you can or can't either way let me look at the script, thanks!

Post a Comment