If you want to skip the theory and put your hands straight into code, take a look at my sample parser script on Github.
kindlegen binary available in your path, so you can call it from anywhere.KindleGen creates books in the binary MOBI format (actually, Amazon’s AZW format is just MOBI with DRM).
Although we can generate an ebook from a plain HTML file, if we want to include navigation, cover, a table of contents, etc. we need to create a bunch of HTML files (one per chapter) and a OPF file. This one is just a XML file which contains the book’s metadata (author, title, publisher, etc.) and content structure.
In KindleGen’s zip file you can find an example of an OPF file, ready to be processed with this tool.
We could just download a HTML file, strip all the tags except a few ones (paragraphs, basic text formatting, etc.) and use this as our e-book content. In practice, you will want only some parts of the website. That’s why you need to analyze (parse) the content and grab only those parts you are interested in.
In this example, we’ll be parsing FanFiction.net, a site that hosts fan-created stories based on existing books, videogames, etc. In this website, stories are divided in chapters, and each one of them is served from a different URL. So, chapter #2 for the story with ID #6718049 can be accessed at http://www.fanfiction.net/s/6718049/2/.
If we look at the source code of that page, we can see that the main content is inside a DIV tag with the ID storytext:

Downloading the chapter and grabbing the story content is really easy:
doc = Nokogiri::HTML(open("http://www.fanfiction.net/s/6718049/2/"), "UTF-8")
content = doc.xpath('//div[@id="storytext"]').first.inner_html
The important method here is xpath, which is an XPath selector for the nodes in the document. This is really handy and usually the easiest way to access the tags we want to. If you have never used it, there’s a tutorial at W3Schools.
This time, we don’t need to worry about stripping unwanted HTML tags, since FanFiction.net CMS system only allows basic formatting tags. If this weren’t the case, we could use the inner_text method available in the Node class.
We are also interested in the chapter’s title, which is available in a dropdown list at the top of the page.
Chapters in the dropdown have this format: first comes the chapter number, followed by a dot and the chapter title. For instance: 1. This is a chapter title. The trick here is that the option element that stores this information has its value attribute with the name of the chapter (which we already know).

So we can search for that tag, take its inner text, and remove the "1. " string with a regular expresion:
title = doc.xpath("//option[@value='#{index}']").first.inner_text.gsub(/^\d+\. /, "")
Once we have downloaded and parsed all the chapter we want to include in the e-book, we need to create a HTML file for each of them. After that, we will also need to create an OPF file with links to those chapters. The easiest way to do this is using some kind of template system, like ERB (which comes with the standard Ruby class library).
The template for the chapters can be as simple as this one:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body>
<h1>
<small>Chapter <%= index %></small><br />
<%= title %>
</h1>
<%= content %>
</body>
</html>
The code inside <%= and %> are Ruby variables that will be printed. Now we only have to create a file, supply the template with the variables that contain the chapter’s data, and render the template in the file we’ve just created.
# extend Hash class to transform key-value pairs into a binding to feed ERB templates
class Hash
def to_binding
res = Object.new
res.instance_eval("def binding_for(#{keys.join(',')}) binding end")
res.binding_for(*values)
end
end
# creates a HTML file using the template and data provided
def create_html_file(filename, template, data)
File.open(filename, "w+:utf-8") do |file|
file << ERB.new(File.read(template)).result(data.to_binding)
end
end
To finally create the HTML file with a chapter, we only need to put the data we want to input into the template in a Hash, and then call our create_html_file method.
chapter_data = {:title => title, :index => 1, :content => content}
create_html_file("chapter1.html", "chapter.html.erb", chapter_data)
The only thing left is to create a loop that iterates over all chapters, parses them and creates their corresponding HTML files.
With all the chapters HTML files already created, it's time to generate the OPF file. Inside Amazon's KindleGen distribution you can find a fairly complete OPF example file, but here's an ERB template with the bare minimum stuff:
<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="BookId">
<!-- Metadata: -->
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
<dc:title><%= title %></dc:title>
<dc:creator><%= author %></dc:creator>
</metadata>
<!-- manifest (book content) -->
<manifest>
<!-- list of resources -->
<% chapters.each_with_index do |chapter, index| %>
<item id="item<%= index %>" media-type="text/x-oeb1-document" href="<%= chapter %>"></item>
<% end %>
<!-- our content, ordered -->
<spine>
<itemref idref="item-toc"/>
<% chapters.each_index do |index| %>
<itemref idref="item<%= index %>"/>
<% end %>
</spine>
</manifest>
<!-- Guide key points -->
<guide>
<reference type="text" title="Beginning" href="<%= chapters.first %>"></reference>
</guide>
</package>
The most important section is the manifest. Inside it, we need to declare all the resources (in our case, the HTML files), and then link those resources in the spine subsection. Therefore, it doesn't matter in which order we declare our resources —the final book content will be ordered according to the spine specification.
If we wanted a table of contents, we only need to create a regular HTML file with links to the chapters (we can do this using an ERB template), and then include this file as if it were another resource in the manifest.
KindleGen is fairly easy to use: we only need to pass the name of the OPF file and we'll get a MOBI file. There are other flags, but for now we'll use -unicode, since that is the character encoding that the website we parsed uses (KindleGen's default encoding is Latin-1).
Assuming we have installed KindleGen in some location accessible in our path, we can call it inside our Ruby script by using back ticks:
puts `kindlegen book.opf -unicode`
And that's all, folks! If you are more curious or get stuck, take a look at my Github repository, with a full working example and table of contents creation.
This was a very basic example of how to generate a e-book from a website. In the Real World, chances are that you stumble upon a login screen or maybe you need navigate a site to automate even more this process. In this case, take a look at Mechanize, a Ruby gem that automates interaction with a website like if it were a browser.
]]>