News Feed Scraper Application
March 27, 2008
Intro to the Feed Manager Application
There are plenty of feed readers, but there don’t seem to be any feed scrapers readily available. By a `feed scraper’ I mean a program that will download and store the actual text of feed items gathered from rss (or atom) feeds.
News Feed Scraper
I am releasing my news feed scraper. The development environment for the release is C++ under Microsoft Visual Studio 2005 using Qt Commercial Edition from Trolltech. If you want to just run the executables you do not need Visual Studio or Qt. The executables have been tested on Vista and XP.
The release consists of three downloads:
-
A set of development libraries here;
-
A generic template for creating plugins here;
-
A feed server and feed client here.
In order to run the software you will need to:
-
Download the feed server and client;
-
Unzip the download. This will create a FeedManager and a FeedServer directory;
-
Run FeedClient/Executable/FeedClient.exe;
-
Run FeedManager/Executable/FeedManager.exe;
-
Drag the url from a feed into the FeedManager…
In order to compile the software you will need to:
-
Download the three downloads;
-
Follow the directions here to install the development libraries (in the course of which you will define an environment variable $(DIR4APPS), which contains the name of your preferred installation directory);
-
Install the plugin solution in the $(DIR4APPS) directory. This should create a $(DIR4APPS)/ModifyUrlPlugins/ directory;
-
Install the FeedManager solution in the $(DIR4APPS) directory;
-
You should be able load the solution file $(DIR4APPS)/FeedManager/FeedManager.sln.
The ModifyUrlPlugins directory illustrates how to write .dll plugins under Qt that can be explicitly linked at run-time with a statically-linked executable. See here for a write-up on this.
Function of the Feed Manager Application
The news feed scraper consists of two separate applications: a Feed Client, and a Feed Server.
The Feed Server maintains a database containing the URLs of a series of rss feeds. Periodically, the Feed Server checks to see if there are any new articles mentioned in the rss feeds, by downloading a new version of the stalest feed in its database. When the server encounters a new article mentioned in a new version of a feed, it puts the URL of the article into a table in its database.
The Feed Client periodically checks with the Feed Server to see if it knows of any new articles. If it does, the Feed Client downloads the article and stores it, uncompressed, in a directory called Articles. If the article is unavailable, for any reason, it is skipped and not retried.
The Feed Server and Feed Client communicate with one another via the UDP protocol. They can be run on the same machine or on different machines visible to one another on the same network. A maximum of 3 Feed Clients can communicate with one Feed Server. The Feed Server downloads feeds; the Feed Client downloads items (articles) mentioned in those feeds.
Screen Shots
The Feed Server
Here’s a screen shot of the Feed Server. The application has two modes: server mode, and database Mode. In server mode the application displays the articles being served and their status. In database mode the application allows you to edit the database of feeds the application is managing. The application continues to download feeds and manage communication with the Feed Client in the background while in database mode.
The Feed Client
Here’s a screen shot of the Feed Client. The application has only one mode: client mode. In client mode the application displays the articles being downloaded and their status.
Status of the Application
I have used the application to download about a gigabyte (so far) of text from news feeds. The application is a reasonably well-behaved net citizen: it has a substantial delay between repeat visits to the same web-site, for instance. However, it does not use the http `if-modified-since’ feature. It probably should.
The application runs pretty smoothly for an application that hasn’t been banged on too much, but I’m sure there are still a few bugs in there somewhere. Never forget the General Rule of Debugging:
-
Bugs occur with all frequencies;
-
The time it takes to identify and fix a bug is proportional to the inverse of its frequency.
Therefore, if you’ve only banged on an application for a few weeks, you are going to miss the bugs that only occur on a monthly or yearly schedule. And just because you haven’t seen such infrequent bugs doesn’t mean they aren’t there.


RSS
[...] The feed text-scraper application is released and described here. [...]
Hi,
I would like to try out your scraper, but the directory of the feed manager exe is missing from the compressed file. Could you make the feed man excutable/directory available please.
Thanks,
Dave Chan
Dave,
Sorry about the glitch. I will upload a new zip file when I get back to the US early next week. Check back in a few days. I’ll get it done by Wednesday 10 Sept at the latest.
I’ve used the app to download a few dozen gigabytes of news articles in the last couple months. If you are looking to create a text database, it seems to be working pretty well…
Thanks Brainrack,
I’ll be honest, I’m looking to start some article marketing and was looking for a tool like yours, as a quicker way to obtain material to rewrite myself, or to get rewritten for me. I also wasn’t expecting a reply so it was a nice surprise to see one, thanks.
I look forward to trying your tool.
Best Regards,
David
David,
I’m not sure this tool is what you want, but you are welcome to give it a spin. I uploaded (just) the executables to here: http:\\www.LibertyReach.com\Distributions\Feed.zip
You must run both the FeedManager.exe and FeedClient.exe. In order to start scraping, you drag-and-drop an RSS or ATOM feed pointer.
Beware that some feeds are ill-formed, etc., so the application will complain when it sees a bad one, but you can just ignore the complaints.
Good luck.
BTW,
I’ve been unbelievably busy this last couple months, so I didn’t have a chance to test-run the exe’s. They *should* be fine, but I haven’t run them in a few weeks so I can’t promise. Sorry.
Thanks again for the download, I’ll try it out asap!
Regards,
Dave.