RSS Push Model (ex. PubSub) is not viable in long run

By Angsuman Chakraborty, Gaea News Network
Wednesday, March 23, 2005

PubSub is a tool to push feed items to your desktop. You register at PubSub, identify topics of interest and the news will start coming to your desktop automated and in near real-time.

This model shares the same weakness as email. Both push content to your desktop. Email now also pushes spam (around 80% of email is spam as per NYT) to your desktop without your consent. The problem with push model is that it is easy to exploit it and feed spam. Anyone can submit feed to PubSub and it is easy to create and submit hundreds of spam feeds to PubSub to easily pollute the results. In fact they can be easily automated by any programmer.

However a simple pull model where you subscribe to feeds of interest like New York Times or Wired News etc. are much less likely to go wrong. You can choose the quality feeds and then use an online (BlogLines) or offline aggregator to view them.

Update: It appears I was correct. PubSub site went offline as of 15 January 2007 due to financial trouble.
BTW: Pubsub’s assets were bought by another company which plans to re-launch the site but that is another story.

Discussion

Eugene Jen
April 5, 2005: 10:49 am

I assume that graph structure in blogsphere can help us identifying spamming blogs from regular blogs. Therefore, a pubsub system can build a reputation system on clusters. Therefore publishers’ reputation will be the criterion for subscribers to receive their publications. I think it is possible to build such system to weed out spammers. If this can be acheived, then spammers will at the same time be discouraged in the blogsphere. It may create a positive feedback dynamics. Therefore content based pubsub system will be a better model than pull based aggregator model.


Eugene Jen
April 5, 2005: 10:11 am

I believe that incoming random links to spammers’ blogs inserted by blogger will be insignificant in blogsphere. Also it is easy to ignore that next blog feature since it is the same for all the blogger users. It is also not hard to ignore adwords from google, yahoo, msn and so on inside blogs.

There exists algorithms to cluster random graphs. For example https://micans.org/mcl/lit/index.html . It is very possible to cluster the whole blogsphere, then apply pattern recognition algorithm to identified spammer clusters. Which I believe will be better than email spam detectors.

Also, if you ever play around page rank algorithm, You should know that page rank is just a special case in clustering a random graph. But page rank algorithm did not consider clustering, so that’s how google bomb was invented. By I believe google starts to identify those clusters set up by spammers.

April 4, 2005: 5:56 am

@Per I think PubSub is actually pushing content to my desktop. This is in contrast to the normal model with RSS, where I subscribe to selected feeds and fetch contents from their site; directly or through bloglines etc.

We may argue over the nomenclature but the core problem with the PubSub model remains.

I am not aware of Source to restrict data from certain feeds only. Can you please elaborate.

Even if it is available and I do use it, then PubSub loses its value, because I can as easily fetch my selected feeds directly or through bloglines. PubSub doesn’t have any strategic advantage in this scenario.

April 4, 2005: 5:53 am

@Eugene Would you please elaborate further on your cluster theory and how it specifically affects this situation?

Many spammers have blogger accounts and so do decent people. In blogger there is a link to randomly connect you to any other blog, a model which directly breaks your clustering idea.

I may subscribe to high quality blogs only thereby lending some credence to your cluster theory.

However in PubSub’s case it doesn’t care about any cluster’s. So as soon as I subscribe to keyword “Java”, I immediately subject myself to any spam originating which contains the keyword Java. Not only that it now comes to my desktop and I have no control over it, other than to disable/uninstall PubSub.

Hope that clarifies…


Per Hamnqvist
April 3, 2005: 1:01 pm

The case you are making is not one of push versus pull, but one of only listening to a set of “trusted” feeds. I thought you could limit the scope of your PubSub subscriptions to attain the same effect, using SOURCE: .. except that you get the information faster.


Eugene Jen
April 3, 2005: 8:28 am

Well, it seems like the spamming on weblog has differnt features to email spamming. While it is hard to prevent anyone with an email address to send spam since email was designed for easy communication between 2 parties. It is possible to filter out blog spams by link structure among blogs and websites. A group of spammers may even form a cluster but I believe the percentage of outsiders to link into cluster in almost zero. While a non spamming cluster of blogs may have nice balance of incoming and outgoing links from outsiders.

March 24, 2005: 3:39 am

Richard,

I am aware of the publish-subscribe paradigm and I am very much aware of the technologies involved. In fact if you want I can build the core pubsub engine in probably couple of days.

The challenge here is not related to the paradigm but to its specific implementation in PubSub. Suppose I subscribe for “java”. Any post that mentions the word “java” will be displayed to me in real-time. And as you can guess I would not be the only one to have an interest in Java.

It would be trivial for spammers to create free blogs (they already do) which is auto-populated (they already do) with spam posts with a bot (using say Blogger API etc.). Additionally the posts contain keywords of interest to large segment of population. Then they are registered in PubSub.

Such bots can easily generate thousands of posts in a very short time and virtually flood PubSub with spam posts laden with keywords people have subscribed to. So instead of seeing relevant Java posts now I (and others) will see info about organ enhancement pills. You get the picture. And it isn’t pretty.

The problem with PubSub is that it has no way of knowing honest bloggers from spammers, similar to email system.

Publish-Subscribe works great when publishers are trusted. Unfortunately the reality in this scenario is otherwise.

Let me know if you are clear about the problem space.

Angsuman
PS. I have a solution too but its getting late and I am tired :)


Richard Treadway
March 24, 2005: 3:09 am

PubSub is a Prospective Search engine that matches your stored queries against data in the network as it changes. The matches are done in real-time. Prospective search uses the publish-subscribe protocol. In publish-subscribe you subscribe to things you are interested in and when those things are published you get them. If own information you want others to see then you publish it. As a subscriber you own your subscriptions. If you no longer find your subscription useful (think it is spam) then delete it. When you publish you do not need to know who is subscribed to your information. In this way publish-subscribe is very loosely-coupled. Bloging is inherently publishing-subscribe.

This is very different from email. In email, the owner of the information owns who to send it to and they must know who you are. To delete a subscription requires you to inform the sender (publisher) to stop sending you email.

Richard Treadway

YOUR VIEW POINT
NAME : (REQUIRED)
MAIL : (REQUIRED)
will not be displayed
WEBSITE : (OPTIONAL)
YOUR
COMMENT :