What Matt Mullenweg (WordPress Author) Knows About You (WordPress & Akismet Plugin User)

By Angsuman Chakraborty, Gaea News Network
Saturday, April 8, 2006

I took a look at the data we are sending to Akismet, a WordPress plugin for comment spam protection, for each comment submitted on your blog, if you use this plugin for comment spam prevention. I have recently started using Akismet, a WordPress plugin from WordPress author Matt Mullenweg. I have to say I was surprised at the copious amount of data, some sensitive, being sent to Matt’s server for handling every single comment.

Tons of useless (for spam protection) information is being sent for every comment, most of which rarely, if ever, changes on a server.

Here are the data that was sent to Akismet server for a single test comment on my blog. I have commented on them inline.

comment_post_ID=1128 // Why does he need this?
comment_author=Angsuman+Chakraborty
comment_author_email=angsuman%40taragana.com
comment_author_url=http%3A%2F%2Fblog.taragana.com%2F
comment_content=[Actual comment]
comment_type=
user_ID=1 // Why does he need this?
user_ip=59.93.245.60
user_agent=[Truncated]
referrer=[Truncated - Post url]
blog=http%3A%2F%2Fblog.taragana.com
CONTENT_LENGTH=98

// Isn’t it obvious? Why send it? Does it ever change?
CONTENT_TYPE=application%2Fx-www-form-urlencoded

// What is he doing with it? This information is useless for spam protection.
DOCUMENT_ROOT=[File system path]

// Why does he need this? Yet another useless junk.
HTTP_ACCEPT=[Truncated]

// Why does he need this?
HTTP_ACCEPT_CHARSET=[Truncated]
HTTP_ACCEPT_LANGUAGE=en-us%2Cen%3Bq%3D0.5

// Why does he need this?
HTTP_CONNECTION=keep-alive
HTTP_HOST=blog.taragana.com

// Why does he need this?
HTTP_KEEP_ALIVE=300
HTTP_REFERER=[Truncated]
HTTP_USER_AGENT=[Truncated]

// Why does he have to have my PATH information?
PATH=[PATH environment variable]
REMOTE_ADDR=59.93.245.60
REMOTE_PORT=1567

// How many times does it change on a server? Why does he need it?
// It contains file system information
SCRIPT_FILENAME=[Truncated]

// How many times does it change on a server?
SERVER_ADDR=69.36.187.98

// How many times does it change on a server? Why does he need it?
SERVER_ADMIN=Postmaster%40taragana.com
SERVER_NAME=blog.taragana.com

// How many times does it change on a server? What does he need it for?
SERVER_PORT=80

// How many times does it change on a server? What does he need it for?
SERVER_SIGNATURE=[Truncated]
// How many times does it change on a server? What does he need it for?
SERVER_SOFTWARE=[Truncated]

// How many times does it change on a server? What does he need it for?
GATEWAY_INTERFACE=CGI%2F1.1

// How many times does it change on a server? What does he need it for?
SERVER_PROTOCOL=HTTP%2F1.1

// How many times does it change on a server? What does he need it for?
// This is always POST!
REQUEST_METHOD=POST

// How many times does it change on a server? What does he need it for?
QUERY_STRING=

// How many times does it change on a server? What does he need it for?
REQUEST_URI=%2Fwp-comments-post.php

// How many times does it change on a server? What does he need it for?
SCRIPT_NAME=%2Fwp-comments-post.php

// Why does he need to know where I installed WordPress on my server?
PATH_TRANSLATED=[Truncated]

// How many times does it change on a server? What does he need it for?
PHP_SELF=%2Fwp-comments-post.php

// This is inane
argv=Array

// This is inane
argc=0

This huge amount of data (considering it is send for every comment) can consume a not-so-insignificant portion of your bandwidth quota, if you get lots of spam.

It is clear Matt & Co. haven’t taken the effort to filter out the unnecessary information, even though they can easily do so.

Some of these information may also be used by hackers (bad ones). Remember all information is submitted over the internet in cleartext.

Kind of makes you feel warm and fuzzy, doesn’t it?

Discussion
March 21, 2009: 12:45 am

Most of the things you said weren’t useful, actually are very useful for spam blocking. Spam bots usually send different headers compared to a browser, that’s why the user agent, the http accept, content-type, charset, language, keep-alive, etc are sent.

August 1, 2007: 5:53 pm

Don’t forget that Akismet is integrated into other tools too, such as the cakePHP framework so some of that info will be relevant there.

I’m with you on the server path type of thing but the actual calling script is probably important for identifying the weak points (or high traffic points ) on a site. More for future development than current spam detection.

I wouldn’t be blogging today if it wasn’t for Akismet and Bad Behaviour - as it is I have all comments on moderation anyway… it’s that bad!

January 16, 2007: 8:50 am

An addition to my previous post. I’m saying this to Matt not to Angsuman. :)

January 16, 2007: 8:47 am

I my - maybe simple - views these informations are required for analyzing spam:

comment_content # Yeah, sure… ;)
comment_author* # All three together

blog_url (a splogger can easily remove that URL, so you still have his server’s IP number. But what about a sblog like spammer-blog.wordpress.com? Got it? IP is useless, two!

And even the client’s IP/user-agent-string are useless because of open proxies. Yeah, you can blacklist that IP numbers, but how many open proxies exist in the wide world? 100,000 ???

Well, I’ll remove all information which you really don’t need to know from my blog (like absolute paths and such). Only I need to know where your scripts are installed and not you.

I know you can blacklist my ID number so move on. I have more anti-spam plug-ins left to replace with Akismet. :) And Akismet isn’t the ultimate death for spam comments, as well.

I’m not against Matt and all the other people behind Akismet but I really need to know why, why, why you need to know so much useless informations from my blog? Why the comment ID why the absolute path of my script installation?

So long and all the best,
Roland

April 11, 2006: 12:14 pm

[...] In addition, over at Simple Thoughts, Angsuman Chakraborty wrote an interesting post entitled, “What Matt Mullenweg (WordPress Author) Knows About You (WordPress & Akismet Plugin User).” There, he figured out what kind of info Akismet sends back to interpret comments as spam / not spam. All this was very interesting, but it got my no further to my goal of getting out of Akismet jail. My identity had been taken by a black box for unknown reasons, and there was no way to get it back. Granted, on the net it is very easy to change your identity, but I had been writing as myself for quite awhile. Why would I want to give up what little, if any, reputation I have? Especially to the black box? [...]

April 10, 2006: 9:37 am

James,
I guess I reached him faster this way :)
Thanks for your suggestions.

Best,
Angsuman

April 10, 2006: 9:36 am

Matt,

Thanks for the clarifications. However I couldn’t understand why you need data which never changes for any user like:
CONTENT_TYPE=application%2Fx-www-form-urlencoded
REQUEST_METHOD=POST
SERVER_PORT=80 // May very rarely change
SERVER_PROTOCOL=HTTP%2F1.1
GATEWAY_INTERFACE=CGI%2F1.1
etc.

Also there are several pieces of data which I cannot see (irrespective of the algorithm you are using, which I personally think is a variant of naive bayesian with manual blacklisting :) ) how they can help in analysing spam like my servers SCRIPT_FILENAME or PATH_TRANSLATED.

I could see you have a provision in code to filter out certain data from list. Why not use it to get only the data that you need.

Looking forward to your response.

Best,
Angsuman

April 9, 2006: 6:00 pm

We do strip out potentially sensitive data, like your login cookie. The rest is entirely harmless, and actually quite useful in identifying spam. You can exclude it, but the effectiveness of Akismet will go down.

April 8, 2006: 11:03 pm

Akismet’s privacy policy is available to the public here (legal translation coming soon):

https://akismet.com/privacy/

Matt would [probably] be glad if you were to contact him with your privacy/security concerns. If you send your inquiry through the Akismet contact form, he’ll usually respond within the week.

YOUR VIEW POINT
NAME : (REQUIRED)
MAIL : (REQUIRED)
will not be displayed
WEBSITE : (OPTIONAL)
YOUR
COMMENT :