Generate a Google Sitemap in Rails

A while back, we produced a Capistrano task for automatically crawling your site to pre-cache static content (and find any broken links, exceptions, etc.). We’ve extended the task to also generate a Google Sitemap for use with the Google Webmaster Tools.

Our needs were simple: create an XML sitemap without having to add any new controllers, model attributes, and so on. Since we already had a fully functional, deployed Web app, pointing our crawler to it and generating a map containing only valid, internal links made a lot of sense.

Thanks to Alastair’s code, we learned about the structure and mechanism for constructing a Google Sitemap, which involves using Rails’ Builder::XmlMarkup class.

Here’s the code for the revised crawler. Please refer to the original article for configuration details.

require 'mechanize'
 
class Crawler
 
  EXTENSIONS_IGNORED = %w[.csv .doc .docx .gif .jpg .jpeg .js .mp3 
    .mp4 .mpg .mpeg .pdf .png .ppt .rss .swf .txt .xls .xlsx .xml]
 
  PROTOCOLS_IGNORED = %w[feed ftp itms javascript mailto]
 
  def initialize(starting_url, credentials = nil, quiet_mode = false, sitemap = false, debug = false)
    @bad_pages = []  
    @agent = WWW::Mechanize.new
    @sitemap = sitemap
    @debug = debug
    @visited_pages = []
 
    if credentials
      creds = credentials.split(':')
      @agent.basic_auth(creds[0], creds[1])
    end
 
    @quiet_mode = quiet_mode
    @starting_url = starting_url
    @starting_url_domain = starting_url[/([a-z0-9-]+)\.([a-z.]+)/i]
    puts "domain: #{@starting_url_domain}" if @debug
    extract_and_call_urls(starting_url)
    generate_sitemap if @sitemap
  end
 
  def extract_and_call_urls(url)            
    #get page
    puts "#{@visited_pages.size+1} #{url}" unless @quiet_mode
    begin
      page = @agent.get(url)
    rescue => exception
      @bad_pages << url
      puts "error: #{url}, #{exception.message}"
      return
    end
 
    #for any content types we may have missed above, exit if content type is not html
    return if page.instance_of?(WWW::Mechanize::File) || page.content_type.index('text/html') == nil
 
    #add to array
    @visited_pages << url
 
    #get links found on page
    links = page.links
 
    #for each link, call the url if not in history
    links.each{ |link| extract_and_call_urls(link.href) unless 
      ignore_url?(link.href) || @visited_pages.include?(link.href) }
  end
 
  private
 
  def ignore_url?(url)
    begin
      return ignored = true if url.nil? ||
                       (url.include? 'http' and !url.include?("webficient.com")) ||
                       @bad_pages.include?(url) ||
                       PROTOCOLS_IGNORED.find{ |prt| url =~ /#{prt}:/ } != nil ||
                       EXTENSIONS_IGNORED.find{ |ext| url =~ /#{ext}$/ } != nil
    ensure
      puts "ignored: #{url}" if ignored and @debug
    end
  end
 
  def generate_sitemap
  	xml_str = ""
  	xml = Builder::XmlMarkup.new(:target => xml_str, :indent=>2)
 
  	xml.instruct!
  	xml.urlset(:xmlns=>'http://www.sitemaps.org/schemas/sitemap/0.9') {
  		@visited_pages.each do |url|
  		  unless @starting_url == url
    	    xml.url {
      	    xml.loc(@starting_url + url)
      			xml.lastmod(Time.now.utc.strftime("%Y-%m-%dT%H:%M:%S+00:00"))
      			xml.changefreq('weekly')
   			  }
   			end
  		end
  	}
 
  	save_file(xml_str)
  	update_google
  end
 
	# Saves the xml file to disc. This could also be used to ping the webmaster tools
	def save_file(xml)
		File.open(RAILS_ROOT + '/public/sitemap.xml', "w+") do |f|
			f.write(xml)	
		end		
	end
 
	# Notify google of the new sitemap
	def update_google
	    sitemap_uri = @starting_url + '/sitemap.xml'
	    escaped_sitemap_uri = URI.escape(sitemap_uri)
	    Net::HTTP.get('www.google.com',
	                  '/webmasters/sitemaps/ping?sitemap=' +
	                  escaped_sitemap_uri)
	end
 
end

What’s Going On

As the crawler hits each page, it persists a valid URL into an array:

@visited_pages << url

Then, we iterate through the array and build the XML:

xml.url {
  xml.loc(@starting_url + url)
  xml.lastmod(Time.now.utc.strftime("%Y-%m-%dT%H:%M:%S+00:00"))
  xml.changefreq('weekly')
}

Note that we are always setting the modification date of every page to the current time and anticipated frequency of page updates to ‘weekly.’ Feel free to vary or extrapolate into additional config settings.

Connecting it Together

In this iteration, we packaged this up as a Rake task by saving the following task as lib/tasks/admin.rake (inside our Rails application directory). If you want to run this from Capistrano instead, just add the task to config/deploy.rb.

require 'lib/crawler'
 
namespace :admin do
 
  desc "Crawl pages using the Mechanize gem. Set URL variable as a starting point. Set CREDS as username:password if you are hitting a password protected site. To generate a Google Sitemap in /public/sitemap.xml, set SITEMAP=true. To suppress output and only show errors, set QUIET=true. To show more details during output, set DEBUG=true."
  task :crawl_pages do
    start_url = ENV["URL"] || "http://localhost:3000"
    sitemap = Crawler.new(start_url, (ENV["CREDS"] if ENV["CREDS"]), ENV["QUIET"] || false, ENV["SITEMAP"] || false, ENV["DEBUG"] || false)
  end
 
end

Sitemap-ilicious!

Run your new rake task from the command line interface:

rake admin:crawl_pages URL=http://www.webficient.com SITEMAP=true

Google is automatically pinged as the final step. A copy of the sitemap is also saved to public/sitemap.xml. You can point Google’s Webmaster Tools to this URL (http://mywebsite.com/sitemap.xml) to track Google crawler statistics.


About this entry