Generate a Google Sitemap in Rails
A while back, we produced a Capistrano task for automatically crawling your site to pre-cache static content (and find any broken links, exceptions, etc.). We’ve extended the task to also generate a Google Sitemap for use with the Google Webmaster Tools.
Our needs were simple: create an XML sitemap without having to add any new controllers, model attributes, and so on. Since we already had a fully functional, deployed Web app, pointing our crawler to it and generating a map containing only valid, internal links made a lot of sense.
Thanks to Alastair’s code, we learned about the structure and mechanism for constructing a Google Sitemap, which involves using Rails’ Builder::XmlMarkup class.
Here’s the code for the revised crawler. Please refer to the original article for configuration details.
require 'mechanize' class Crawler EXTENSIONS_IGNORED = %w[.csv .doc .docx .gif .jpg .jpeg .js .mp3 .mp4 .mpg .mpeg .pdf .png .ppt .rss .swf .txt .xls .xlsx .xml] PROTOCOLS_IGNORED = %w[feed ftp itms javascript mailto] def initialize(starting_url, credentials = nil, quiet_mode = false, sitemap = false, debug = false) @bad_pages = [] @agent = WWW::Mechanize.new @sitemap = sitemap @debug = debug @visited_pages = [] if credentials creds = credentials.split(':') @agent.basic_auth(creds[0], creds[1]) end @quiet_mode = quiet_mode @starting_url = starting_url @starting_url_domain = starting_url[/([a-z0-9-]+)\.([a-z.]+)/i] puts "domain: #{@starting_url_domain}" if @debug extract_and_call_urls(starting_url) generate_sitemap if @sitemap end def extract_and_call_urls(url) #get page puts "#{@visited_pages.size+1} #{url}" unless @quiet_mode begin page = @agent.get(url) rescue => exception @bad_pages << url puts "error: #{url}, #{exception.message}" return end #for any content types we may have missed above, exit if content type is not html return if page.instance_of?(WWW::Mechanize::File) || page.content_type.index('text/html') == nil #add to array @visited_pages << url #get links found on page links = page.links #for each link, call the url if not in history links.each{ |link| extract_and_call_urls(link.href) unless ignore_url?(link.href) || @visited_pages.include?(link.href) } end private def ignore_url?(url) begin return ignored = true if url.nil? || (url.include? 'http' and !url.include?("webficient.com")) || @bad_pages.include?(url) || PROTOCOLS_IGNORED.find{ |prt| url =~ /#{prt}:/ } != nil || EXTENSIONS_IGNORED.find{ |ext| url =~ /#{ext}$/ } != nil ensure puts "ignored: #{url}" if ignored and @debug end end def generate_sitemap xml_str = "" xml = Builder::XmlMarkup.new(:target => xml_str, :indent=>2) xml.instruct! xml.urlset(:xmlns=>'http://www.sitemaps.org/schemas/sitemap/0.9') { @visited_pages.each do |url| unless @starting_url == url xml.url { xml.loc(@starting_url + url) xml.lastmod(Time.now.utc.strftime("%Y-%m-%dT%H:%M:%S+00:00")) xml.changefreq('weekly') } end end } save_file(xml_str) update_google end # Saves the xml file to disc. This could also be used to ping the webmaster tools def save_file(xml) File.open(RAILS_ROOT + '/public/sitemap.xml', "w+") do |f| f.write(xml) end end # Notify google of the new sitemap def update_google sitemap_uri = @starting_url + '/sitemap.xml' escaped_sitemap_uri = URI.escape(sitemap_uri) Net::HTTP.get('www.google.com', '/webmasters/sitemaps/ping?sitemap=' + escaped_sitemap_uri) end end
What’s Going On
As the crawler hits each page, it persists a valid URL into an array:
@visited_pages << url
Then, we iterate through the array and build the XML:
xml.url { xml.loc(@starting_url + url) xml.lastmod(Time.now.utc.strftime("%Y-%m-%dT%H:%M:%S+00:00")) xml.changefreq('weekly') }
Note that we are always setting the modification date of every page to the current time and anticipated frequency of page updates to ‘weekly.’ Feel free to vary or extrapolate into additional config settings.
Connecting it Together
In this iteration, we packaged this up as a Rake task by saving the following task as lib/tasks/admin.rake (inside our Rails application directory). If you want to run this from Capistrano instead, just add the task to config/deploy.rb.
require 'lib/crawler' namespace :admin do desc "Crawl pages using the Mechanize gem. Set URL variable as a starting point. Set CREDS as username:password if you are hitting a password protected site. To generate a Google Sitemap in /public/sitemap.xml, set SITEMAP=true. To suppress output and only show errors, set QUIET=true. To show more details during output, set DEBUG=true." task :crawl_pages do start_url = ENV["URL"] || "http://localhost:3000" sitemap = Crawler.new(start_url, (ENV["CREDS"] if ENV["CREDS"]), ENV["QUIET"] || false, ENV["SITEMAP"] || false, ENV["DEBUG"] || false) end end
Sitemap-ilicious!
Run your new rake task from the command line interface:
rake admin:crawl_pages URL=http://www.webficient.com SITEMAP=true
Google is automatically pinged as the final step. A copy of the sitemap is also saved to public/sitemap.xml. You can point Google’s Webmaster Tools to this URL (http://mywebsite.com/sitemap.xml) to track Google crawler statistics.
About this entry
Posted: Saturday, September 6th, 2008 at 12:51 am
- Author:
- Phil Misiowiec
- Category:
- Solutions
- License:
- Creative Commons

24 Comments
Jump to comment form | comments rss | trackback uri