Code Answer: Program to analyze a lot of XMLs

I have a lot of XML files and I'd like to generate a report from them. The report should provide information such as:

root 100%
 a*1 90%
 b*1 80%
  c*5 40%

meaning that all documents have a root element, 90% have one a element in the root, 80% have one b element in the root, 40% have 5 c elements in b.

If for example some documents have 4 c elements, some 5 and some 6, it should say something like:

c*4.3 4 6 40%

meaning that 40% have between 4 and 6 c elements there, and the average is 4.3.

I am looking for free software, if it doesn't exist I'll write it. I was about to do it, but I thought about checking it. I may not be the first one to have to analyze and get an structural overview of thousand of XML files.

From stackoverflow J. Pablo Fernández

Here's an XSLT 2.0 method.

Assuming that $docs contains a sequence of document nodes that you want to scan, you want to create one line for each element that appears in the documents. You can use <xsl:for-each-group> to do that:

<xsl:for-each-group select="$docs//*" group-by="name()">
  <xsl:sort select="current-group-key()" />
  <xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
  <xsl:value-of select="$name" />
  ...
</xsl:for-each-group>

Then you want to find out the stats for that element amongst the documents. First, find the documents have an element of that name in them:

<xsl:variable name="docs-with" as="document-node()+"
  select="$docs[//*[name() = $name]" />

Second, you need a sequence of the number of elements of that name in each of the documents:

<xsl:variable name="elem-counts" as="xs:integer+"
  select="$docs-with/count(//*[name() = $name])" />

And now you can do the calculations. Average, minimum and maximum can be calculated with the avg(), min() and max() functions. The percentage is simply the number of documents that contain the element divided by the total number of documents, formatted.

Putting that together:

<xsl:for-each-group select="$docs//*" group-by="name()">
  <xsl:sort select="current-group-key()" />
  <xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
  <xsl:variable name="docs-with" as="document-node()+"
    select="$docs[//*[name() = $name]" />
  <xsl:variable name="elem-counts" as="xs:integer+"
    select="$docs-with/count(//*[name() = $name])" />
  <xsl:value-of select="$name" />
  <xsl:text>* </xsl:text>
  <xsl:value-of select="format-number(avg($elem-counts), '#,##0.0')" />
  <xsl:text> </xsl:text>
  <xsl:value-of select="format-number(min($elem-counts), '#,##0')" />
  <xsl:text> </xsl:text>
  <xsl:value-of select="format-number(max($elem-counts), '#,##0')" />
  <xsl:text> </xsl:text>
  <xsl:value-of select="format-number((count($docs-with) div count($docs)) * 100, '#0')" />
  <xsl:text>%</xsl:text>
  <xsl:text>&#xA;</xsl:text>
</xsl:for-each-group>

What I haven't done here is indented the lines according to the depth of the element. I've just ordered the elements alphabetically to give you statistics. Two reasons for that: first, it's significantly harder (like too involved to write here) to display the element statistics in some kind of structure that reflects how they appear in the documents, not least because different documents may have different structures. Second, in many markup languages, the precise structure of the documents can't be known (because, for example, sections can nest within sections to any depth).

I hope it's useful none the less.

UPDATE:

Need the XSLT wrapper and some instructions for running XSLT? OK. First, get your hands on Saxon 9B.

You'll need to put all the files you want to analyse in a directory. Saxon allows you to access all the files in that directory (or its subdirectories) using a collection using a special URI syntax. It's worth having a look at that syntax if you want to search recursively or filter the files that you're looking at by their filename.

Now the full XSLT:

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xs">

<xsl:param name="dir" as="xs:string"
  select="'file:///path/to/default/directory?select=*.xml'" />

<xsl:output method="text" />

<xsl:variable name="docs" as="document-node()*"
  select="collection($dir)" />

<xsl:template name="main">
  <xsl:for-each-group select="$docs//*" group-by="name()">
    <xsl:sort select="current-group-key()" />
    <xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
    <xsl:variable name="docs-with" as="document-node()+"
      select="$docs[//*[name() = $name]" />
    <xsl:variable name="elem-counts" as="xs:integer+"
      select="$docs-with/count(//*[name() = $name])" />
    <xsl:value-of select="$name" />
    <xsl:text>* </xsl:text>
    <xsl:value-of select="format-number(avg($elem-counts), '#,##0.0')" />
    <xsl:text> </xsl:text>
    <xsl:value-of select="format-number(min($elem-counts), '#,##0')" />
    <xsl:text> </xsl:text>
    <xsl:value-of select="format-number(max($elem-counts), '#,##0')" />
    <xsl:text> </xsl:text>
    <xsl:value-of select="format-number((count($docs-with) div count($docs)) * 100, '#0')" />
    <xsl:text>%</xsl:text>
    <xsl:text>&#xA;</xsl:text>
  </xsl:for-each-group>
</xsl:template> 

</xsl:stylesheet>

And to run it you would do something like:

> java -jar path/to/saxon.jar -it:main -o:report.txt dir=file:///path/to/your/directory?select=*.xml

This tells Saxon to start the process with the template named main, to set the dir parameter to file:///path/to/your/directory?select=*.xml and send the output to report.txt.

David Robbins : Jeni, I'm a big fan of yours, as your site has helped me with a great many challenges with XPath. I'm glad that you are joining us here at StackOverflow.

David Robbins : Saw your update - now I get it :)

From JeniT

Check out Gadget

From Mads Hansen
Beautiful Soup makes parsing XML trivial in python.

From jeremy

[community post, here: no karma involved;) ]
I propose a code-challenge here:

parse all xml find in xmlfiles.com/examples and try to come up with the following output:

Analyzing plant_catalog.xml: 
Analyzing note.xml: 
Analyzing portfolio.xml: 
Analyzing note_ex_dtd.xml: 
Analyzing home.xml: 
Analyzing simple.xml: 
Analyzing cd_catalog.xml: 
Analyzing portfolio_xsl.xml: 
Analyzing note_in_dtd.xml: 
Statistical Elements Analysis of 9 xml documents with 34 elements
CATALOG*2 22%
  CD*26 50%
    ARTIST*26 100%
    COMPANY*26 100%
    COUNTRY*26 100%
    PRICE*26 100%
    TITLE*26 100%
    YEAR*26 100%
  PLANT*36 50%
    AVAILABILITY*36 100%
    BOTANICAL*36 100%
    COMMON*36 100%
    LIGHT*36 100%
    PRICE*36 100%
    ZONE*36 100%
breakfast-menu*1 11%
  food*5 100%
    calories*5 100%
    description*5 100%
    name*5 100%
    price*5 100%
note*3 33%
  body*1 100%
  from*1 100%
  heading*1 100%
  to*1 100%
page*1 11%
  para*1 100%
  title*1 100%
portfolio*2 22%
  stock*2 100%
    name*2 100%
    price*2 100%
    symbol*2 100%

tye : This challenge doesn't honor the second part of the original question, tracking how many times a tag appears.

VonC : I am not sure I follow: the number of occurrence of a tag is right after the '*' . It that element were to appear with a different cardinality in different xml files, it would be displayed as "min...max average" cardinality. Since it is not the case in the xmlcodes example files, you have only '*x'.

From VonC

Here is a possible solution in ruby to this code-challenge...
Since it is my very first ruby program, I am sure it is quite terribly coded, but at least it may answer J. Pablo Fernandez's question.

Copy-paste it in a '.rb file and calls ruby on it. If you have an Internet connection, it will work ;)

require "rexml/document"
require "net/http"
require "iconv"
include REXML
class NodeAnalyzer
  @@fullPathToFilesToSubNodesNamesToCardinalities = Hash.new()
  @@fullPathsToFiles = Hash.new() #list of files in which a fullPath node is detected
  @@fullPaths = Array.new # all fullpaths sorted alphabetically
  attr_reader :name, :father, :subNodesAnalyzers, :indent, :file, :subNodesNamesToCardinalities
    def initialize(aName="", aFather=nil, aFile="")
     @name = aName; @father = aFather; @subNodesAnalyzers = []; @file = aFile
    @subNodesNamesToCardinalities = Hash.new(0)
    if aFather && !aFather.name.empty? then @indent = "  " else @indent = "" end
    if aFather
      @indent = @father.indent + self.indent
      @father.subNodesAnalyzers << self
      @father.updateSubNodesNamesToCardinalities(@name)
    end
    end
  @@nodesRootAnalyzer = NodeAnalyzer.new
  def NodeAnalyzer.nodesRootAnalyzer
    return @@nodesRootAnalyzer
  end
  def updateSubNodesNamesToCardinalities(aSubNodeName)
    aSubNodeCardinality = @subNodesNamesToCardinalities[aSubNodeName]
    @subNodesNamesToCardinalities[aSubNodeName] = aSubNodeCardinality + 1
  end
  def NodeAnalyzer.recordNode(aNodeAnalyzer)
    if aNodeAnalyzer.fullNodePath.empty? == false
      if @@fullPaths.include?(aNodeAnalyzer.fullNodePath) == false then @@fullPaths << aNodeAnalyzer.fullNodePath end
      # record a full path in regard to its xml file (records it only one for a given xlm file)
      someFiles = @@fullPathsToFiles[aNodeAnalyzer.fullNodePath]
      if someFiles == nil 
        someFiles = Array.new(); @@fullPathsToFiles[aNodeAnalyzer.fullNodePath] = someFiles; 
      end
      if !someFiles.include?(aNodeAnalyzer.file) then someFiles << aNodeAnalyzer.file end
    end
    #record cardinalties of sub nodes for a given xml file
    someFilesToSubNodesNamesToCardinalities = @@fullPathToFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.fullNodePath]
    if someFilesToSubNodesNamesToCardinalities == nil 
      someFilesToSubNodesNamesToCardinalities = Hash.new(); @@fullPathToFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.fullNodePath] = someFilesToSubNodesNamesToCardinalities ; 
    end
    someSubNodesNamesToCardinalities = someFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.file]
    if someSubNodesNamesToCardinalities == nil
      someSubNodesNamesToCardinalities = Hash.new(0); someFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.file] = someSubNodesNamesToCardinalities
      someSubNodesNamesToCardinalities.update(aNodeAnalyzer.subNodesNamesToCardinalities)
    else
      aNodeAnalyzer.subNodesNamesToCardinalities.each() do |aSubNodeName, aCardinality|
        someSubNodesNamesToCardinalities[aSubNodeName] = someSubNodesNamesToCardinalities[aSubNodeName] + aCardinality
      end
    end  
    #puts "someSubNodesNamesToCardinalities for #{aNodeAnalyzer.fullNodePath}: #{someSubNodesNamesToCardinalities}"
  end
  def file
    #if @file.empty? then @father.file else return @file end
    if @file.empty? then if @father != nil then return @father.file else return '' end else return @file end
  end
  def fullNodePath
    if @father == nil then return '' elsif @father.name.empty? then return @name else return @father.fullNodePath+"/"+@name end
  end
    def to_s
    s = ""
    if @name.empty? == false
      s = "#{@indent}#{self.fullNodePath} [#{self.file}]\n"
    end
    @subNodesAnalyzers.each() do |aSubNodeAnalyzer|
      s = s + aSubNodeAnalyzer.to_s
    end
    return s
    end
  def NodeAnalyzer.displayStats(aFullPath="")
    s = "";
    if aFullPath.empty? then s = "Statistical Elements Analysis of #{@@nodesRootAnalyzer.subNodesAnalyzers.length} xml documents with #{@@fullPaths.length} elements\n" end
    someFullPaths = @@fullPaths.sort
    someFullPaths.each do |aFullPath|
      s = s + getIndentedNameFromFullPath(aFullPath) + "*"
      nbFilesWithThatFullPath = getNbFilesWithThatFullPath(aFullPath);
      aParentFullPath = getParentFullPath(aFullPath)
      nbFilesWithParentFullPath = getNbFilesWithThatFullPath(aParentFullPath);
      aNameFromFullPath = getNameFromFullPath(aFullPath)
      someFilesToSubNodesNamesToCardinalities = @@fullPathToFilesToSubNodesNamesToCardinalities[aParentFullPath]
      someCardinalities = Array.new()
      someFilesToSubNodesNamesToCardinalities.each() do |aFile, someSubNodesNamesToCardinalities|
        aCardinality = someSubNodesNamesToCardinalities[aNameFromFullPath]
        if aCardinality > 0 && someCardinalities.include?(aCardinality) == false then someCardinalities << aCardinality end
      end
      if someCardinalities.length == 1
        s = s + someCardinalities.to_s + " "
      else
        anAvg = someCardinalities.inject(0) {|sum,value| Float(sum) + Float(value) } / Float(someCardinalities.length)
        s = s + sprintf('%.1f', anAvg) + " " + someCardinalities.min.to_s + "..." + someCardinalities.max.to_s + " "
      end
      s = s + sprintf('%d', Float(nbFilesWithThatFullPath) / Float(nbFilesWithParentFullPath) * 100) + '%'
      s = s + "\n"
    end
    return s
  end
  def NodeAnalyzer.getNameFromFullPath(aFullPath)
    if aFullPath.include?("/") == false then return aFullPath end
    aNameFromFullPath = aFullPath.dup
    aNameFromFullPath[/^(?:[^\/]+\/)+/] = ""
    return aNameFromFullPath
  end
  def NodeAnalyzer.getIndentedNameFromFullPath(aFullPath)
    if aFullPath.include?("/") == false then return aFullPath end
    anIndentedNameFromFullPath = aFullPath.dup
    anIndentedNameFromFullPath = anIndentedNameFromFullPath.gsub(/[^\/]+\//, "  ")
    return anIndentedNameFromFullPath
  end
  def NodeAnalyzer.getParentFullPath(aFullPath)
    if aFullPath.include?("/") == false then return "" end
    aParentFullPath = aFullPath.dup
    aParentFullPath[/\/[^\/]+$/] = ""
    return aParentFullPath
  end
  def NodeAnalyzer.getNbFilesWithThatFullPath(aFullPath)
    if aFullPath.empty? 
      return @@nodesRootAnalyzer.subNodesAnalyzers.length
    else
      return @@fullPathsToFiles[aFullPath].length;
    end
  end
end
class REXML::Document
    def analyze(node, aFatherNodeAnalyzer, aFile="")
    anNodeAnalyzer = NodeAnalyzer.new(node.name, aFatherNodeAnalyzer, aFile)
    node.elements.each() do |aSubNode| analyze(aSubNode, anNodeAnalyzer) end
    NodeAnalyzer.recordNode(anNodeAnalyzer)
  end
end

begin
  anXmlFilesDirectory = "xmlfiles.com/examples/"
  anXmlFilesRegExp = Regexp.new("http:\/\/" + anXmlFilesDirectory + "([^\"]*)")
  a = Net::HTTP.get(URI("http://www.google.fr/search?q=site:"+anXmlFilesDirectory+"+filetype:xml&num=100&as_qdr=all&filter=0"))
  someXmlFiles = a.scan(anXmlFilesRegExp)
  someXmlFiles.each() do |anXmlFile|
    anXmlFileContent = Net::HTTP.get(URI("http://" + anXmlFilesDirectory + anXmlFile.to_s))
    anUTF8XmlFileContent = Iconv.conv("ISO-8859-1//ignore", 'UTF-8', anXmlFileContent).gsub(/\s+encoding\s*=\s*\"[^\"]+\"\s*\?/,"?")
    anXmlDocument = Document.new(anUTF8XmlFileContent)
    puts "Analyzing #{anXmlFile}: #{NodeAnalyzer.nodesRootAnalyzer.name}"
    anXmlDocument.analyze(anXmlDocument.root,NodeAnalyzer.nodesRootAnalyzer, anXmlFile.to_s)
  end
  NodeAnalyzer.recordNode(NodeAnalyzer.nodesRootAnalyzer)
  puts NodeAnalyzer.displayStats
end

JesperE : Ouch. If you just used a little shorter identifiers, the code might even be readable.

From VonC

Go with JeniT's answer - she's one of the first XSLT guru's I started learning from back on '02. To really appreciate the power of XML you should work with XPath and XSLT and learn to manipulate the nodes.

VonC : I agree, JeniT's answer is the right track... but it is not a full script that I can just launch and check if it works. Feel free to write a more complete xslt script to answer the challenge ;)

David Robbins : Jeni assumes that you concatenate the docs before you run the script/ process. Wouldn't that suffice? You could concat the files with file streams, load up the XML/XSLT namespace stuff and execute.

JeniT : Actually, I don't assume that the docs are concatenated, but that they're in an XPath sequence of document nodes, which is most easily generated using `collection()`. I've updated my answer above to give the full stylesheet.

From David Robbins

Code Answer

Wednesday, February 9, 2011

Program to analyze a lot of XMLs

0 comments:

Post a Comment

Blog Archive