Crawling The Web For Fun And Profit

The Camtasia Studio video content presented here requires a more recent version of the Adobe Flash Player. If you are you using a browser with JavaScript disabled please enable it now. Otherwise, please update your version of the free Flash Player by downloading here.

Posted By: SecurityTube_Bot
Posted On: Mon 21 Feb 2011
Views: 9843
Share this video:
Share it on Facebook Share it on Twitter Share it on Reddit Share it on Digg Share it on Stumbleupon
Support SecurityTube:


Description: With over a couple of billion web pages on the Internet, it is but tempting to see how one can mine much of this information for fun or for profit. In this video, i run you through how to program a web crawler which will fetch pages and parse their content, so it can be converted into a useful format.

The web crawler which we create in this tutorial, consists of mainly 2 parts:

  1. Document fetching engine : This fetches the raw HTML page data from a website

  2. Document parsing engine : This uses an HTML DOM Parser to parse the page and derive useful input from it.
Once you have learned how to parse the data, then the next step is to store the data in a database. This will allow you to tun further analysis on the data and derive interesting insights.

We shall use the Python language and the BeautifulSoup DOM parser to pull this off. The video is very interactive and i use a "type as you go" methodology to help you understand the programming techniques.

The code for this tutorial is available for download. 

Tags: programming ,


Comments (None)

Login to post a comment