One of my clients recently needed to copyright their website. After speaking with their lawyers, I learned they needed all the source code, images, a list of technologies we didn't create (gems, frameworks, etc) and a screen shot of every page on the site.
The first requirements were easily obtained. I downloaded the github repo for the source code and images. I took the Gemfile and bower.json file and created a list of technologies that we didn't create. But, the screen capture thing was a bit of a quandary. The site has over 500 pages and capturing a full page screen shot(scroll and all) is a tedious task.
Then I remembered reading about a gem called capybara-screenshot and started tinkering with a short little program to take a screen shot of all the pages on the site. This is what I came up with:
require 'pry' require 'capybara' require 'capybara/dsl' require 'capybara-screenshot' Capybara.default_driver = :selenium Capybara.app_host = 'http://www.example.com' Capybara.save_and_open_page_path = "screenshots" class Screenshot include Capybara::DSL def initialize(urls) @urls = urls end def crawl_and_snap @urls.each do |url| snap url end end def snap(path) visit path Capybara::Screenshot::Saver.new( Capybara, Capybara.page, false, image_name(path)).save end def image_name(path) path.gsub("/", "-") end end
It wasn't that difficult, but I hit several small snags along the way...
Getting Capybara to work outside of rails or rack
I love Capybara and use it in almost every rails app I work on. However,
I've never used it outside of rails. At first I just tried to setup a
_spec file and run it, but I guess because the file wasn't
located in features it wasn't expecting to use Capybara. So I found that
you can simply use the
Capybara::DSL module and get all of the methods
we want (visit, find, etc).
Taking the screen shot
Capybara::Screenshot is a simple little tool to use. Normally you'd just
screenshot_and_save_page which I did at first, but it gives you a
file name like
Screenshot-2014-07-28.png. I wanted the filenames to be
equivalent to the web uri (ie
/about should be saved as
I didn't see a way to pass a filename to the
method, so after digging in the source, I saw that that method just
Capybara::Screenshot::Saver.new and then calls
save on that.
It took 4 arguments. The first two were related to the capybara visit.
The third was whether to also save an html file and the fourth the
filename. Notice there's also a config you can set for the default
directory to save the images.
I wanted to use something other than Selenium so Firefox wasn't running, but I tried webkit and PhantomJS and both worked, but had several issues like missing images and default width of the page (caused my responsive site to hit a breakpoint). And the resulting image with the Selenium driver was super crisp.
Urls to Capture
I didn't really have a problem getting all of the urls we needed to crawl. I use the great sitemap_generator gem to create an xml sitemap for the site, so it was easy to get an output of all the urls for the site.
A couple of other ideas if you don't have a list of all the urls:
2) If you're satisified that google has all of your pages indexed, then you could use the google-search gem to fetch all of the pages in their index, get the urls and pass them to Screenshot. Example:
search = Google::Search::Web.new do |search| search.query = "site:www.example.com" search.size = :large end urls = search.all.map(&:uri) Screenshot.new(urls).crawl_and_snap
I blocked out my clients name and downsized it, but you can see how it does a nice job of capturing the whole page.
I was thinking about creating a small gem out of it. What do you think, would it be useful?