Will "the Mighty" Strohl

Iterating Through to Find Invalid XML Files

I have custom DotNetNuke module that displays various information about hotels.  As you might imagine, this information could become quite lengthy for each hotel.  The module looks in the requested URL, and uses the hotel information in the URL to determine where to look for to load its information. 

While the hotel information could come from a database table, it instead comes from an XML file.  There is a single XML file for each hotel.  These XML files are downloaded from a data provider regularly to make sure the information is up-to-date.  There are well more than 100,000 XML files in my situation.

Recently, some of these files have become invalid.  After some troubleshooting, I found that there are XML fields that now contained HTML.  On its own, the new HTML makes the XML files invalid.

Why is HTML Invalid in XML?

By default, you cannot put HTML into an XML document.  The XML parser in all languages will throw an error when you try to access the XML document programmatically.  This is easily fixed.  All you have to do is add the CDATA section elements inside of the invalid XML elements to fix this.  For example, the following snippet shows an XML element with invalid content.

<descriptionText>The <u>really cool</u> brown fox ran fast & hard.</descriptionText>

Unfortunately, XML parsers will throw an error when this element is attempted to be parsed.  The <, >, and & characters are "illegal" characters and cannot be used.  Instead, the previous snippet could be parsed by implementing the CDATA element.

<descriptionText><![CDATA[The <u>really cool</u> brown fox ran fast & hard.]]></descriptionText>

All I did was add the <![CDATA]]> element to wrap the text.  Anything inside of the CDATA element is considered "legal", resulting in the parser ignoring any reserved or illegal characters.

Which XML Files are Invalid?

The hard way to find out what files are invalid is to open them one at a time in an editor that parses the file.  We obviously wouldn't want to do that.  In my situation, the first thing I needed to find out is exactly how many of my 100,000+ XML files are invalid.  This would help me decide how to proceed in terms of a solution.

Instead, I wrote a quick and dirty APSX web page to iterate through all of the XML files, attempt to parse them, and display the file name if it is invalid.  Here is a screen shot of what that might look like once the web page is run.

List of Invalid XML Files

I have attached my sample web page to this blog entry.  The mark-up and codebehind are commented thoroughly to assist you.  You should be able to just download it and place it in your website.  In order for it to work though, you first need to go into the codebehind, and change the file path it searches through.

This file searches through the directory specified in the code behind.  Then, it attempts to load each XML file into an in-memory XmlDocument object.  If an error occurs, the XML file is invalid.  It's as simple as that.

This file searched through 100,000+ XML files in less than 30 seconds for me.  In addition, it showed me the file path, file name, and made each file clickable.  This allowed me to open the file to fix it individually if I decided to do so.

Attached File: FindErroredXMLFiles.zip

Technorati Tags: , , ,

blog comments powered by Disqus