HTML validation in minitest with Sinatra

19 min. read

You would think that automatic HTML validation or malformed markup detection is a well solved problem in 2016, and you would be right... kind of. It is, for some well-known frameworks. But I had some trouble to find the right solution for the combination Sinatra plus minitest.

Since it took me at least two days to figure it out I decided to share it here, maybe somebody finds it useful!

TL;DR version

I ended up using Eric Beland's HTML validation gem, for which you need tidy-HTML5. I made it optional to install tidy though. In the tests, I make a request and validate the response against the gem's methods, and I skip the HTML validation if tidy is not installed locally. But I do install tidy on Travis so that these tests always run there (this also means that if you broke the HTML with a change to an erb template, and you don't have tidy installed locally, you won't notice until Travis is finished running).

You can find the PR that implements this gem and take a look at the Travis file here. The PR also includes a test using it and the shell script that installs tidy on Travis.

Ok, so now: if you like movies like Indiana Jones, Star-Wars or Lord of the Rings, keep reading for some adventures that you may or may not enjoy!

A long time ago in a galaxy far, far away...

...a developer made a PR to add nothing more than four characters: </a>. The reason was so that nations could remember the reviewer as "the human that looked at the most complex PR and survived to tell the story" (please always do tiny PRs. Thank you 🙂 ) .

The reviewer wisely mentioned that bug fixes should always have regression tests (i.e., I didn't have tests to detect malformed markup). Fear not, as our dear heroes were prepared to save the world by using Nokogiri in strict mode and adding a test to check that the errors array was empty.

But this of course, opened Pandora's box. Otherwise, there would be no story.

Nokogiri doesn't seem to speak HTML5

Despite their documentation page on markup validation, Nokogiri's syntax checker doesn't seem to detect all malformed markup, only some of it. What's more, it doesn't seem to support HTML5 tags. For example, in pry, I tried:

Nokogiri::HTML("<header>") {|config| config.strict }.errors
=> [#<Nokogiri::XML::SyntaxError: Tag header invalid>]

Nokogiri::HTML("<div><a>") {|config| config.strict }.errors
=> []

Nokogiri::HTML("<div></a>") {|config| config.strict }.errors
=> [#<Nokogiri::XML::SyntaxError: Unexpected end tag : a>]

Nokogiri::HTML("<a/><a/><a/>") {|config| config.strict }.errors
=> []

You can use Nokogiri::XML as well, but then you get other errors, like:

Nokogiri::XML::SyntaxError: Premature end of data in tag html line 5

when I opened the page in a browser, line 5 was an innocent and unharmful <html class="no-js">.

I Googled and couldn't believe that nobody was complaining about this. I went to Nokogiri's GitHub repo and there are no similar issues (also nothing on their Nokogiri talk page), apart from this, but they basically ignored it. It is from 2013 but as you can see I tried it in pry and get the same result.

I posted a question in Stack Overflow that same day and I am still waiting for a reply.

So I searched for other natural options that wouldn't force me to install another gem.

Other native options

Things that came to my mind:

  • Is there a linter that I can use in our editors and maybe drop and version-control a config file (like .eslintrc in JavaScript projects)? I could only find a linter-erb module but it is for atom, and it doesn't need any config file.
  • There is a rails-erb-lint gem, but it is for Rails obviously. Nothing for Sinatra.
  • Maybe I could find something like rubocop for erb files? I found nothing like that, or it would lint just the ruby code on them. But in the more interesting Mark's words:

    evaluating the erb might always produce valid HTML. Or the other way around. In computer science terms, telling if erb always evaluates to valid HTML is equivalent to the Halting Problem, so you can't do it.

So, with no native options... I searched for a gem next.

Raiders of the lost gem

Basically this is the summary of what I found:

  • Sanitize (updated a month ago), it is a tool to fix broken HTML, rather than just to detect it, so I discarded it. I want to fix the errors at the template level, before the HTML is produced. But it is based on Gumbo, and it was made by the guy who made the RawGit CDN, so you might find it cool for some of your projects.
  • NokoGumbo (updated Aug 2016), parses the HTML, but doesn't seem to give you errors. I tried it, and not only it uses Nokogiri, but it is also in the category of gems that will fix the HTML instead of telling you what's wrong. Discarded.
  • HTML proofer (updated Aug 2016), checks HTML files, (or directories full of HTML files), so you would have to do some dance to pass it the string of the HTML response in tests:

    or if that doesn't work you can try this:
, type: :file, check_html: true, checks_to_ignore: ["ScriptCheck", "ImageCheck", "LinkCheck"]).process_files

    then a missing </div> causes the return values to be something like this:

    [{:external_urls=>{}, :failures=>[#<HTMLProofer::Issue:0x007fd9b47414a0 @desc="Opening and ending tag mismatch: header and div", @line=" (line 78)", @path="index.html", @status=-1>]}]

    Sadly, it uses Nokogiri for the validation, so I discarded it. The fun fact is that I am using it already in another project (to validate the EP-data docs site). At least they took the trouble to manually ignore invalid tag errors that appear for HTML5 tags
    and also htmlParseEntityRef errors as well.

    This gave me the idea of monkey-patching Nokogiri to achieve the same and avoid installing the gem:

    module Nokogiri
      module XML
        class DocumentFragment < Nokogiri::XML::Node
          def errors
          # iterate on document.errors and do stuff with them

    because WHAT COULD POSSIBLY GO WRONG. More importantly, I had already discarded Nokogiri! Please laugh at me. Harder. (Of course I didn't carry on with this).

  • be valid asset (updated Jul 2016) it was made for RSpec. The cool thing about it is that it can also validate CSS. But they officially confirmed that it can't be used with minitest. Discarded
  • Ruby-vnu (updated 2015) is a bit old (a colleague commented "it must be super robust because it has no issues!" LOL).

    But the cool thing about it is that it uses the nu validator, which is the official W3C validator. HOW COOL IS THAT. I tried it and confirmed that I get the same errors than with the online W3C validator.



    ...Travis was red. 🙁

    The nu validator was built in Java, so you have to install the openjdk-8-jre-headless package. I didn't notice any problems because I have worked with Java and Android projects in the past, and I probably have more Java stuff in my laptop than you could expect the average Ruby dev to have.

    Ok so this should be easy, I just have to install Java in Travis. How complex could that be?

    Turns out that I couldn't use Java 8 in our container based build in Travis. I tried installing the openjdk-7-jre-headless package instead, but then the nu-validator didn't work. I switched to Trusty in Travis, so that I could use Java 8, which also means some commands have to be sudoed, but I discovered that Travis was still using a container based build (and you can’t use sudo on Travis' container based builds). I tried several many other things in the Travis config file, including the Travis file that nu-validator is using, with no success.

    I sent a message to Travis on Twitter.

    Introducing a Java dependency scared the team at more levels than just getting Travis working. So we agreed to make this work in a way that people using it didn't have to install Java (If you are curious about why the nu validator was written in Java, read here, the section "Java? Eww. Why didn’t you write it in Python or Ruby?").

    At this point I just went back to the happy times when I just wanted to close an unclosed link tag. And this also came to my mind:

    So, since our tests all use spec format and I just use minitest to drive them rather than rspec, I thought of trying out one of the solutions made for RSpec, maybe using a thin shim that transforms them to work with whatever minitest does with specs. The other option was moving to RSpec right away. But really the requirements at this point were:

    1. it runs on Travis
    2. it doesn't fail if someone doesn't have it installed
    3. it doesn't force someone to install something that might be complex
    4. it should be easy for anyone to run locally if the prereqs are installed

    So I moved to the next option:

  • Finally, HTML validation (updated Aug 2015), it was made for RSpec, although it seems to work without RSpec as well. HoIver, it requires a previous installation of tidy for HTML5. So it was a decision on installing tidy or installing Java, really. Hopefully, the installation of tidy worked on Travis.

Both Mac and Linux seem to come with a version of Tidy preinstalled, and in both cases the version is the super-old 10-years-old version. So in order to install tidy it is recommended that you uninstall the old version of Tidy first. Then to install tidy-HTML5, in Mac you can do brew install tidy, and in Linux you can follow the instructions in our shell sript in viewer sinatra, skiping the step where it creates a bin directory in the home directory. Or you can just download the deb if you use a debian based distro.

To exclude running those tests if the dev doesn't have tidy installed, I thought about "tagging" those specs as needing html-tidy and only running them on Travis by default, using something like minispec-metadata, then adding html_validations: true or something to those specs, etc. But a much simpler solution is to just use skip:

def last_response_must_be_valid
  skip if `which tidy`.empty?
  validation =, last_request.url)
  assert validation.valid?, validation.exceptions

and then your tests will whisper to you like this: ······SSSS············· 🙂

Other gems

There are other gems out there, but I didn't bother with them because they seem to have been left forgotten. I leave them here for posterity:

Phew! that was all. I hope I didn't bore you and you found it useful!