Pre-Caching Rails Views
If you’re caching any pages in your Rails app as a performance enhancer, you may find this post useful. If your Web application gets a decent amount of traffic, then waiting for Rails to cache your Action View templates as HTML, post-deployment, is not ideal. While people are banging on your site, the CPU is spiking, things are queuing, and everything slows down.
We’ve cooked up a Capistrano recipe that leverages the Ruby Mechanize gem and crawls your application, starting at the site index or any page you specify. So instead of sitting there and clicking each link, learn about how you can use this Cap task to enhance your next project.
Features
- Able to crawl local or remotely hosted application
- Crawling limited to application pages, does not follow external links
- Ignores links pointing to non-HTML content
- Traps HTTP errors and moves on to next item
Installation
First, be sure you have the Mechanize gem installed:
sudo gem install mechanize
Now save this file to lib/crawler.rb inside your Ruby on Rails application:
require 'mechanize' class Crawler EXTENSIONS_IGNORED = %w[.csv .doc .docx .gif .jpg .jpeg .js .mp3 .mp4 .mpg .mpeg .pdf .png .ppt .rss .swf .txt .xls .xlsx .xml] def initialize(starting_url, history_size = 1000, credentials = nil, quiet_mode = false) @bad_pages = [] @agent = WWW::Mechanize.new @agent.history.max_size = history_size @agent.redirect_ok = false if credentials creds = credentials.split(':') @agent.basic_auth(creds[0], creds[1]) end @quiet_mode = quiet_mode @starting_url = starting_url extract_and_call_urls(starting_url) end def extract_and_call_urls(url) #catch any previously failed requests as well as non-html doc types up front and exit return if @bad_pages.include?(url) || EXTENSIONS_IGNORED.detect{ |ext| url =~ /#{ext}$/ } != nil #get page puts "url: #{url}" unless @quiet_mode begin page = @agent.get(url) rescue => exception @bad_pages < < url puts "url: #{url}, #{exception.message}" return end #for any content types we may have missed above, exit if content type is not html return if page.content_type.index('text/html') == nil #get links found on page links = page.links #for each link, call the url if not in history links.each{ |link| extract_and_call_urls(link.href) unless ignore_url?(link.href) || @agent.visited?(link.href) } end private def ignore_url?(url) return true if url.nil? ( url.index('http://') != nil || url.index('https://') != nil || url.index('ftp://') != nil || url.index('mailto:') != nil || url.index('itms://')) && url.index('http://localhost') == nil && url.index('http://127.0.0.1') == nil end end
Now, you need to make the above functionality available as a Capistrano task. Add this line to the top of config/deploy.rb:
require '/lib/tasks/crawler'
And at the bottom of deploy.rb, add this task:
desc "Crawl pages using the Mechanize gem. Supported variables: URL (starting point), CREDS (HTTP authentication, format must be username:password), HISTORY (stores 1000 pages by default), QUIET (set to true to suppress output and only show errors)" task :crawl_pages do start_url = ENV["URL"] || "http://localhost:3000" Crawler.new(start_url, ENV["HISTORY"] || 1000, (ENV["CREDS"] if ENV["CREDS"]), ENV["QUIET"] || false) end
Usage
As with any other Capistrano task, you’ll need to use your command line shell within your application’s directory. Check out Webistrano if you prefer running Capistrano tasks from your browser.
There are several ways you can apply the crawl_pages tasks. You can run the command on your development machine, add the files to your source code repository, and they’ll be automatically included in your next deployment to production. This approach mostly makes sense for pages with static content.
cap crawl_pages
If, however, your content is dynamic (perhaps generated by a content management system), then you’ll need to run the task against your live site. The way the task is written allows you to run the command from your local machine by specifying a remote URL. However, the task limits itself to crawling only pages associated with the site.
cap crawl_pages URL=http://mysiteurl.com
Automatic Crawling
If you’d like the crawler task to fire off automatically, just add the following line to deploy.rb:
after :deploy, :crawl_pages after "deploy:migrations", :crawl_pages
Other Uses
Besides pre-caching pages, you can also use the crawler task to check for broken links and exceptions. Any HTTP exception will be trapped and displayed in the command line output. For exceptions, you can correlate the problem in the server log.
About this entry
Posted: Saturday, June 21st, 2008 at 12:31 pm
- Author:
- Phil Misiowiec
- Category:
- Solutions
- Tags:
- capistrano, performance, recipes, ruby on rails
- License:
- Creative Commons

1 Comment
Jump to comment form | comments rss | trackback uri