Friday, June 19, 2009

Easy solution to strip certain HTML elements using XSLT with Groovy

I won't waste much time explaining XSLT or Groovy. Refer the links associated with each to know more about these.

Problem in our hand is that we many times scrape html for existing sites to create one mashed up html. Many times we don't want certain html elements to be included in the resultant html. How do we do that? Fortunately XSLT comes to our rescue. Also on the implementation of XSLT engine I would be using Java based Groovy scripting language. Scripting languages are best to use when you want quick results. You can bootstrap the scripting engine quickly from your server and get the job done. Groovy is a very useful scripting language and now a web framework GRAILS based on Groovy has come up to help you develop web applications with minimal effort. I think in future this is going to be a serious competitor to much popular Ruby based RAILS framework.

So lets jump to the problem and its solution:
Problem we have is that from the html defined in input variable we don't want certain div elements to be included, in the resultant html.
We simply define a valid XSLT pattern for same.
Here it is: div[@id='ad*']
This simply means that exclude all divs whose id starts with 'ad'.

The solution is described in the Groovy script below.

import javax.xml.transform.TransformerFactory

def pattern = "div[@id='ad*']"
def input = """
<div id="ad1x1">
this is NOT OK!
<div id="ab1x1">
this is OK!
this is OK!

def xslt = """
<xsl:stylesheet xmlns:xsl="" version="1.0">
<!-- By default, copy all nodes unchanged -->
<xsl:template match="@* | node()">
<xsl:apply-templates select="@* | node()"/>
<!-- but strip the matched ones -->
<xsl:template match="${pattern}" />

def factory = TransformerFactory.newInstance()
def source = new StreamSource(new StringReader(xslt))
def transformer = factory.newTransformer(source)
transformer.transform(new StreamSource(new StringReader(input)), new StreamResult(System.out))

So as you can see defining pattern is so simple, all you need is some html element and its attributes to match for exclusion.
The resultant html after transformation would have the same input html minus the excluded elements.

Hope you found this useful.

Any comments feel free to contact me.


No comments: