Pre-Caching Rails Views

If you’re caching any pages in your Rails app as a performance enhancer, you may find this post useful. If your Web application gets a decent amount of traffic, then waiting for Rails to cache your Action View templates as HTML, post-deployment, is not ideal. While people are banging on your site, the CPU is spiking, things are queuing, and everything slows down.

We’ve cooked up a Capistrano recipe that leverages the Ruby Mechanize gem and crawls your application, starting at the site index or any page you specify. So instead of sitting there and clicking each link, learn about how you can use this Cap task to enhance your next project.

Features

  • Able to crawl local or remotely hosted application
  • Crawling limited to application pages, does not follow external links
  • Ignores links pointing to non-HTML content
  • Traps HTTP errors and moves on to next item

Installation

First, be sure you have the Mechanize gem installed:

sudo gem install mechanize

Now save this file to lib/crawler.rb inside your Ruby on Rails application:

require 'mechanize'
 
class Crawler
 
  EXTENSIONS_IGNORED = %w[.csv .doc .docx .gif .jpg .jpeg .js .mp3 
    .mp4 .mpg .mpeg .pdf .png .ppt .rss .swf .txt .xls .xlsx .xml]
 
  def initialize(starting_url, history_size = 1000, credentials = nil, quiet_mode = false)
    @bad_pages = []  
    @agent = WWW::Mechanize.new
    @agent.history.max_size = history_size
    @agent.redirect_ok = false
 
    if credentials
      creds = credentials.split(':')
      @agent.basic_auth(creds[0], creds[1])
    end
 
    @quiet_mode = quiet_mode
    @starting_url = starting_url
    extract_and_call_urls(starting_url)
  end
 
  def extract_and_call_urls(url)    
    #catch any previously failed requests as well as non-html doc types up front and exit    
    return if @bad_pages.include?(url) || EXTENSIONS_IGNORED.detect{ |ext| url =~ /#{ext}$/ } != nil
 
    #get page
    puts "url: #{url}" unless @quiet_mode
    begin
      page = @agent.get(url)
    rescue => exception
      @bad_pages < < url
      puts "url: #{url}, #{exception.message}"
      return
    end
 
    #for any content types we may have missed above, exit if content type is not html
    return if page.content_type.index('text/html') == nil 
 
    #get links found on page
    links = page.links
 
    #for each link, call the url if not in history
    links.each{ |link| extract_and_call_urls(link.href) unless 
      ignore_url?(link.href) || @agent.visited?(link.href) }
  end
 
  private
 
  def ignore_url?(url)
    return true if url.nil?
    ( url.index('http://') != nil || 
      url.index('https://') != nil || 
      url.index('ftp://') != nil || 
      url.index('mailto:') != nil || 
      url.index('itms://')) &&
    url.index('http://localhost') == nil &&
    url.index('http://127.0.0.1') == nil
  end
 
end

Now, you need to make the above functionality available as a Capistrano task. Add this line to the top of config/deploy.rb:

require '/lib/tasks/crawler'

And at the bottom of deploy.rb, add this task:

desc "Crawl pages using the Mechanize gem. Supported variables: URL (starting point), CREDS (HTTP authentication, format must be username:password), HISTORY (stores 1000 pages by default), QUIET (set to true to suppress output and only show errors)"
  task :crawl_pages do
    start_url = ENV["URL"] || "http://localhost:3000"
    Crawler.new(start_url, ENV["HISTORY"] || 1000, (ENV["CREDS"] if ENV["CREDS"]), ENV["QUIET"] || false)
  end

Usage

As with any other Capistrano task, you’ll need to use your command line shell within your application’s directory. Check out Webistrano if you prefer running Capistrano tasks from your browser.

There are several ways you can apply the crawl_pages tasks. You can run the command on your development machine, add the files to your source code repository, and they’ll be automatically included in your next deployment to production. This approach mostly makes sense for pages with static content.

cap crawl_pages

If, however, your content is dynamic (perhaps generated by a content management system), then you’ll need to run the task against your live site. The way the task is written allows you to run the command from your local machine by specifying a remote URL. However, the task limits itself to crawling only pages associated with the site.

cap crawl_pages URL=http://mysiteurl.com

Automatic Crawling

If you’d like the crawler task to fire off automatically, just add the following line to deploy.rb:

after :deploy, :crawl_pages
after "deploy:migrations", :crawl_pages

Other Uses

Besides pre-caching pages, you can also use the crawler task to check for broken links and exceptions. Any HTTP exception will be trapped and displayed in the command line output. For exceptions, you can correlate the problem in the server log.


About this entry