Jonathan Martin
Ruby Extension: HTML Truncation
~ by Jonathan Martin
Another tip (err hurdle) I came across during the production of this blog — truncating an HTML string. Easy, right?
It seems simple enough: shorten some basic text content from a long entry. It’s extremely popular in blogs, catalogs, portfolios, etc. and with good reason — the average browser wants to find content through screening, not mass scrolling.
But a good trimmer must keep a few things in mind.
- Don’t split words
- Recognize/respect HTML tags
- Parse HTML according to standards
These add up to some pretty terse requirements once you actually get to coding. First, unless we want to manually parse HTML, we’ll have to use some standards based parser and loop through all the elements, until the specified number of characters/words (excluding tags!) is exceeded, at which point we append a user-defined tail and discard all other tags.
Update: the latest version of this handy widget is now available as a gem! Check it out at rubygems.org/gems/butter or bundle it with gem install butter.
First attempt
Solution 1 came from a blog, and was then heavily modified to make it work with a more modern interface.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | require 'rexml/parsers/pullparser'
require 'htmlentities'
class String
# Truncate strings containing HTML code
# Usage example: "string".truncate_html(50, :word_cut => false, :tail => '[+]')
def truncate_html(len = 30, opts = {})
opts = {:word_cut => true, :tail => '…'}.merge(opts)
p = REXML::Parsers::PullParser.new(self)
coder = HTMLEntities.new
tags = []
new_len = len
results = ''
while p.has_next? && new_len > 0
p_e = p.pull
case p_e.event_type
when :start_element
tags.push p_e[0]
results << "<#{tags.last} #{attrs_to_s(p_e[1])}>"
when :end_element
results << "</#{tags.pop}>"
when :text
text = coder.decode(p_e[0])
if (text.length > new_len) and !opts[:word_cut]
piece = text.first(text.index(' ', new_len))
else
piece = text.first(new_len)
end
results << coder.encode(piece)
new_len -= text.length
else
results << "<!-- #{p_e.inspect} -->"
end
end
tags.reverse.each do |tag|
results << "</#{tag}>"
end
results << opts[:tail]
if html_safe? then
results.html_safe
else
results
end
end
private
def attrs_to_s(attrs)
if attrs.empty?
''
else
attrs.to_a.map { |attr| %{#{attr[0]}="#{attr[1]}"} }.join(' ')
end
end
end |
This worked great…at first. All went well in development, but once I launched into production and started using slightly more complex HTML, it completely crashed the index pages. What was the cause? An <em>
tag. Why? I have no clue, but I know where the problem occured: the REXML parser. On top of being an outdated parser (I suppose that includes choking on em tags), it is one of the slowest parsers out there. So it looked like my nice little online script was about useless, and frankly it seemed way too complex/inelegant for our modern gem-based apps.
Final attempt
Thankfully though, with some more searching I found an elegant solution using Nokogiri (the best parser gem by far!) and some creative word boundary logic. The bulk of this code (and comments) was designed by Eleo, but I modified the interface a bit (instance method instead of class method) and added a few other tweaks.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | require 'nokogiri'
require 'htmlentities'
class String
def truncate_html(num_words = 30, opts = {})
opts = {:word_cut => true, :tail => "…"}.merge(opts)
tail = HTMLEntities.new.decode(opts[:tail])
doc = Nokogiri::HTML(self)
current = doc.children.first
count = 0
while true
# we found a text node
if current.is_a?(Nokogiri::XML::Text)
count += current.text.split.length
# we reached our limit, let's get outta here!
break if count > num_words
previous = current
end
if current.children.length > 0
# this node has children, can't be a text node,
# lets descend and look for text nodes
current = current.children.first
elsif !current.next.nil?
#this has no children, but has a sibling, let's check it out
current = current.next
else
# we are the last child, we need to ascend until we are
# either done or find a sibling to continue on to
n = current
while !n.is_a?(Nokogiri::HTML::Document) and n.parent.next.nil?
n = n.parent
end
# we've reached the top and found no more text nodes, break
if n.is_a?(Nokogiri::HTML::Document)
break;
else
current = n.parent.next
end
end
end
if count >= num_words
unless count == num_words
new_content = current.text.split
# If we're here, the last text node we counted eclipsed the number of words
# that we want, so we need to cut down on words. The easiest way to think about
# this is that without this node we'd have fewer words than the limit, so all
# the previous words plus a limited number of words from this node are needed.
# We simply need to figure out how many words are needed and grab that many.
# Then we need to -subtract- an index, because the first word would be index zero.
# For example, given:
# <p>Testing this HTML truncater.</p><p>To see if its working.</p>
# Let's say I want 6 words. The correct returned string would be:
# <p>Testing this HTML truncater.</p><p>To see...</p>
# All the words in both paragraphs = 9
# The last paragraph is the one that breaks the limit. How many words would we
# have without it? 4. But we want up to 6, so we might as well get that many.
# 6 - 4 = 2, so we get 2 words from this node, but words #1-2 are indices #0-1, so
# we subtract 1. If this gives us -1, we want nothing from this node. So go back to
# the previous node instead.
index = num_words-(count-new_content.length)-1
if index >= 0
new_content = new_content[0..index]
current.content = new_content.join(' ') + tail
else
current = previous
current.content = current.content + tail
end
end
# remove everything else
while !current.is_a?(Nokogiri::HTML::Document)
while !current.next.nil?
current.next.remove
end
current = current.parent
end
end
# now we grab the html and not the text.
# we do first because nokogiri adds html and body tags
# which we don't want
truncated = doc.root.children.first.children.first.inner_html
if html_safe?
truncated.html_safe
else
truncated
end
end
end |
I’ve been really happy with the success of this particular implementation. In addition to using Nokogiri, the code is easier to understand and allows for the input of a number of words, rather than number of characters.
Future additions
One of the tweaks I made was the options hash, as I foresee eventually adding more options to the truncate operation, such as using a character count as the length, ability to split before the word boundary, etc. For the moment however, this approach has worked really well. Remember to add this code to string.rb
in your lib directory, and include it with auto_require
.
As always, if you guys have any other options you think might be useful to the truncate_html
method, or alternate solutions for that matter, mention it in the comments!