Using Hpricot

Hpricot is a HTML parser for the Ruby programming language. With Hpricot you can scan and scape a HTML document. To illustrate how to use Hpricot i’ll write a list the code of a short script I recently wrote. The script grabs all the links for the past week from A Rubyist Railstastic Adventure, a tumblelog.

The general structure of the HTML used by the web page that I will be scraping is something like the following.

<div class=”post”>
<div class=”date”>

<div class=”link”>
<a href=”” class=”link”>Juixe TechKnow</a>


One thing to note about the HTML produced by the site we will scape is that the date is optional in the post. The date is only displayed once for a day, so some posts don’t have a given date. Also, there are several other types of posts such as quotes, images, etc. We are only interested in posts with links. Again, the Ruby/Hpricot script will only gather the links for the past week.

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
require ‘parsedate’

# Convert days to number to seconds
def days_to_sec(days)
secs = days.to_i
secs *= 24
secs *= 60
secs *= 60

# pretty print the link in a list
def print_link(link)
print ” <li>”,
“<a href=’#{link.attributes[‘href’]}’>”,

def get_links(doc)
curr_date =
(doc/””).each do |post|
post_date_elem = (post/””)
date = post_date_elem.inner_html.strip

# Parse the date of the post
if date != “”
date_day = (post_date_elem/”big”).text
date_mon = nil
date.each_line do |line|
date_mon = line.strip if date_mon.nil?
break if date_mon.nil?
date_str = “#{date_mon} #{date_day}, #{$week_ago.year}”
data = ParseDate.parsedate date_str
curr_date = Time.local data[0], data[1], data[2]

# Stop if already looking past one week
break if curr_date < $week_ago

# Handle all links in post
(post/””).each do |link|
print_link link

if curr_date > $week_ago
next_page = Hpricot(open($rubyist_next))
get_links next_page

# The Rubyist home page to be scraped
$rubyist_home = “”
$rubyist_next = “”
# Scrape one weeks worth of links
$week_ago = – days_to_sec(7)

# Run the script
doc = Hpricot(open($rubyist_home))
get_links doc

Technorati Tags: , , , ,

4 Responses to “Using Hpricot”

Leave a Reply