My analysis of the actual problem with JavaBlogs Aggregator
By Angsuman Chakraborty, Gaea News NetworkFriday, March 18, 2005
Disclaimer
This analysis is based on observing my blog’s interaction with JavaBlogs. JavaBlogs as you know is a popular aggregator for Java feeds.
Overview
Often many of us see that old posts of our blog keeps popping up in JavaBlogs.
Details
RSS versions before 2.0 did not have GUID. So preventing duplicate posts is slightly harder then RSS 2.0 compliant feeds. My feed is RSS 2.0 compliant. Specifically it sends a GUID as an element of item. GUID is supposed to be globally unique. So if I change my feed url but keep my GUID same it shouldn’t matter.
What does WordPress send as GUID? It sends the permalink to the post as GUID like https://blog.taragana.com/index.php/archive/whats-up-with-republican-java-geeks/.
Technically they are globally unique. Unless I change my site structure. So if I start using .htaccess and change the permalink format to https://blog.taragana.com/archive/whats-up-with-republican-java-geeks/ then I can expect reposting to happen, right? Yes, it does happen in JavaBlogs and it has happened to me once or twice. However it can still be prevented. More on it in a later post.
In any case WordPress can also improve this situation by using a alpha-numeric GUID value instead of permalinks, which may not be so permanent after all.
The more common problem is something much simpler. Suppose you normally syndicate 20 latest items from your feed. Then you suddenly decide to syndicate more say 30. Now suddenly lot of the old feeds are republished again! The GUID hasn’t changed nor the date, only the item count has changed in the feed. Probably the reverse (reducing the number of items in a feed) is also true, cannot remember for sure.
It appears JavaBlogs is maintaining a database of past feed items. So it shouldn’t be hard to identify that the post is not new.
It looks like some simple bug. Hopefully it will be fixed soon.
This article was initiated by a comment from Mr. Charles Miller, developer at JavaBlogs.
PS. On a different note I think the policy to display a feed when its date has been updated is correct implementation by JavaBlogs.
April 3, 2005: 8:09 pm
It’s not just a problem with JavaBlogs! Everytime I ping Technorati that my blog has been updated, it takes every entry previously and spams the Technorati tags (i.e. Java tag) as well! I do use RSS 2.0 and Rome 0.5 from Sun Microsystems to generate my own feeds, and I do use the and tags. I have used the permalink system, but since I can put anything in there since I control the code, maybe I’ll start generating my own MD5 hash as suggested. If anyone wants to know if that works, check out my website in about a week. Otherwise, enjoy reading my entries from March 2005 for the ninteith time. |
March 18, 2005: 2:56 am
Jason, The MD5 of title and timestamp sounds good, I cannot think of anything against it. 304 would be good solution to reduce the bandwidth clog and will ultimately benefit the bloggers. |
March 18, 2005: 2:19 am
Tracking duplicates is a nightmare with all the various RSS flavors and buggy RSS feeds out there. My code for javacrawl.com currently does the following query to check for a duplicate post: “…where (guid = ? OR (link = ? and title = ?))”. This works reasonably well, but is still succeptable to the changing link problem you mention here. I agree that using links for GUID is probably not the best unless they’re stable. An MD5 hash of the title plus the timestamp would be a reasonable way to go. Another suggestion I would have to RSS producers is to please, please implement responding 304 to the If-Modified-Since header. This saves a huge amount of CPU, disk and bandwith resources on both ends. |
Bruce Werner