Wednesday, February 9, 2011

Program to analyze a lot of XMLs

I have a lot of XML files and I'd like to generate a report from them. The report should provide information such as:

root 100%
 a*1 90%
 b*1 80%
  c*5 40%

meaning that all documents have a root element, 90% have one a element in the root, 80% have one b element in the root, 40% have 5 c elements in b.

If for example some documents have 4 c elements, some 5 and some 6, it should say something like:

c*4.3 4 6 40%

meaning that 40% have between 4 and 6 c elements there, and the average is 4.3.

I am looking for free software, if it doesn't exist I'll write it. I was about to do it, but I thought about checking it. I may not be the first one to have to analyze and get an structural overview of thousand of XML files.

  • Here's an XSLT 2.0 method.

    Assuming that $docs contains a sequence of document nodes that you want to scan, you want to create one line for each element that appears in the documents. You can use <xsl:for-each-group> to do that:

    <xsl:for-each-group select="$docs//*" group-by="name()">
      <xsl:sort select="current-group-key()" />
      <xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
      <xsl:value-of select="$name" />
      ...
    </xsl:for-each-group>
    

    Then you want to find out the stats for that element amongst the documents. First, find the documents have an element of that name in them:

    <xsl:variable name="docs-with" as="document-node()+"
      select="$docs[//*[name() = $name]" />
    

    Second, you need a sequence of the number of elements of that name in each of the documents:

    <xsl:variable name="elem-counts" as="xs:integer+"
      select="$docs-with/count(//*[name() = $name])" />
    

    And now you can do the calculations. Average, minimum and maximum can be calculated with the avg(), min() and max() functions. The percentage is simply the number of documents that contain the element divided by the total number of documents, formatted.

    Putting that together:

    <xsl:for-each-group select="$docs//*" group-by="name()">
      <xsl:sort select="current-group-key()" />
      <xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
      <xsl:variable name="docs-with" as="document-node()+"
        select="$docs[//*[name() = $name]" />
      <xsl:variable name="elem-counts" as="xs:integer+"
        select="$docs-with/count(//*[name() = $name])" />
      <xsl:value-of select="$name" />
      <xsl:text>* </xsl:text>
      <xsl:value-of select="format-number(avg($elem-counts), '#,##0.0')" />
      <xsl:text> </xsl:text>
      <xsl:value-of select="format-number(min($elem-counts), '#,##0')" />
      <xsl:text> </xsl:text>
      <xsl:value-of select="format-number(max($elem-counts), '#,##0')" />
      <xsl:text> </xsl:text>
      <xsl:value-of select="format-number((count($docs-with) div count($docs)) * 100, '#0')" />
      <xsl:text>%</xsl:text>
      <xsl:text>&#xA;</xsl:text>
    </xsl:for-each-group>
    

    What I haven't done here is indented the lines according to the depth of the element. I've just ordered the elements alphabetically to give you statistics. Two reasons for that: first, it's significantly harder (like too involved to write here) to display the element statistics in some kind of structure that reflects how they appear in the documents, not least because different documents may have different structures. Second, in many markup languages, the precise structure of the documents can't be known (because, for example, sections can nest within sections to any depth).

    I hope it's useful none the less.

    UPDATE:

    Need the XSLT wrapper and some instructions for running XSLT? OK. First, get your hands on Saxon 9B.

    You'll need to put all the files you want to analyse in a directory. Saxon allows you to access all the files in that directory (or its subdirectories) using a collection using a special URI syntax. It's worth having a look at that syntax if you want to search recursively or filter the files that you're looking at by their filename.

    Now the full XSLT:

    <xsl:stylesheet version="2.0"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      xmlns:xs="http://www.w3.org/2001/XMLSchema"
      exclude-result-prefixes="xs">
    
    <xsl:param name="dir" as="xs:string"
      select="'file:///path/to/default/directory?select=*.xml'" />
    
    <xsl:output method="text" />
    
    <xsl:variable name="docs" as="document-node()*"
      select="collection($dir)" />
    
    <xsl:template name="main">
      <xsl:for-each-group select="$docs//*" group-by="name()">
        <xsl:sort select="current-group-key()" />
        <xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
        <xsl:variable name="docs-with" as="document-node()+"
          select="$docs[//*[name() = $name]" />
        <xsl:variable name="elem-counts" as="xs:integer+"
          select="$docs-with/count(//*[name() = $name])" />
        <xsl:value-of select="$name" />
        <xsl:text>* </xsl:text>
        <xsl:value-of select="format-number(avg($elem-counts), '#,##0.0')" />
        <xsl:text> </xsl:text>
        <xsl:value-of select="format-number(min($elem-counts), '#,##0')" />
        <xsl:text> </xsl:text>
        <xsl:value-of select="format-number(max($elem-counts), '#,##0')" />
        <xsl:text> </xsl:text>
        <xsl:value-of select="format-number((count($docs-with) div count($docs)) * 100, '#0')" />
        <xsl:text>%</xsl:text>
        <xsl:text>&#xA;</xsl:text>
      </xsl:for-each-group>
    </xsl:template> 
    
    </xsl:stylesheet>
    

    And to run it you would do something like:

    > java -jar path/to/saxon.jar -it:main -o:report.txt dir=file:///path/to/your/directory?select=*.xml
    

    This tells Saxon to start the process with the template named main, to set the dir parameter to file:///path/to/your/directory?select=*.xml and send the output to report.txt.

    David Robbins : Jeni, I'm a big fan of yours, as your site has helped me with a great many challenges with XPath. I'm glad that you are joining us here at StackOverflow.
    David Robbins : Saw your update - now I get it :)
    From JeniT
  • Check out Gadget

    alt text

  • Beautiful Soup makes parsing XML trivial in python.

    From jeremy
  • [community post, here: no karma involved;) ]
    I propose a code-challenge here:

    parse all xml find in xmlfiles.com/examples and try to come up with the following output:

    Analyzing plant_catalog.xml: 
    Analyzing note.xml: 
    Analyzing portfolio.xml: 
    Analyzing note_ex_dtd.xml: 
    Analyzing home.xml: 
    Analyzing simple.xml: 
    Analyzing cd_catalog.xml: 
    Analyzing portfolio_xsl.xml: 
    Analyzing note_in_dtd.xml: 
    Statistical Elements Analysis of 9 xml documents with 34 elements
    CATALOG*2 22%
      CD*26 50%
        ARTIST*26 100%
        COMPANY*26 100%
        COUNTRY*26 100%
        PRICE*26 100%
        TITLE*26 100%
        YEAR*26 100%
      PLANT*36 50%
        AVAILABILITY*36 100%
        BOTANICAL*36 100%
        COMMON*36 100%
        LIGHT*36 100%
        PRICE*36 100%
        ZONE*36 100%
    breakfast-menu*1 11%
      food*5 100%
        calories*5 100%
        description*5 100%
        name*5 100%
        price*5 100%
    note*3 33%
      body*1 100%
      from*1 100%
      heading*1 100%
      to*1 100%
    page*1 11%
      para*1 100%
      title*1 100%
    portfolio*2 22%
      stock*2 100%
        name*2 100%
        price*2 100%
        symbol*2 100%
    
    tye : This challenge doesn't honor the second part of the original question, tracking how many times a tag appears.
    VonC : I am not sure I follow: the number of occurrence of a tag is right after the '*' . It that element were to appear with a different cardinality in different xml files, it would be displayed as "min...max average" cardinality. Since it is not the case in the xmlcodes example files, you have only '*x'.
    From VonC
  • Here is a possible solution in ruby to this code-challenge...
    Since it is my very first ruby program, I am sure it is quite terribly coded, but at least it may answer J. Pablo Fernandez's question.

    Copy-paste it in a '.rb file and calls ruby on it. If you have an Internet connection, it will work ;)

    require "rexml/document"
    require "net/http"
    require "iconv"
    include REXML
    class NodeAnalyzer
      @@fullPathToFilesToSubNodesNamesToCardinalities = Hash.new()
      @@fullPathsToFiles = Hash.new() #list of files in which a fullPath node is detected
      @@fullPaths = Array.new # all fullpaths sorted alphabetically
      attr_reader :name, :father, :subNodesAnalyzers, :indent, :file, :subNodesNamesToCardinalities
        def initialize(aName="", aFather=nil, aFile="")
         @name = aName; @father = aFather; @subNodesAnalyzers = []; @file = aFile
        @subNodesNamesToCardinalities = Hash.new(0)
        if aFather && !aFather.name.empty? then @indent = "  " else @indent = "" end
        if aFather
          @indent = @father.indent + self.indent
          @father.subNodesAnalyzers << self
          @father.updateSubNodesNamesToCardinalities(@name)
        end
        end
      @@nodesRootAnalyzer = NodeAnalyzer.new
      def NodeAnalyzer.nodesRootAnalyzer
        return @@nodesRootAnalyzer
      end
      def updateSubNodesNamesToCardinalities(aSubNodeName)
        aSubNodeCardinality = @subNodesNamesToCardinalities[aSubNodeName]
        @subNodesNamesToCardinalities[aSubNodeName] = aSubNodeCardinality + 1
      end
      def NodeAnalyzer.recordNode(aNodeAnalyzer)
        if aNodeAnalyzer.fullNodePath.empty? == false
          if @@fullPaths.include?(aNodeAnalyzer.fullNodePath) == false then @@fullPaths << aNodeAnalyzer.fullNodePath end
          # record a full path in regard to its xml file (records it only one for a given xlm file)
          someFiles = @@fullPathsToFiles[aNodeAnalyzer.fullNodePath]
          if someFiles == nil 
            someFiles = Array.new(); @@fullPathsToFiles[aNodeAnalyzer.fullNodePath] = someFiles; 
          end
          if !someFiles.include?(aNodeAnalyzer.file) then someFiles << aNodeAnalyzer.file end
        end
        #record cardinalties of sub nodes for a given xml file
        someFilesToSubNodesNamesToCardinalities = @@fullPathToFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.fullNodePath]
        if someFilesToSubNodesNamesToCardinalities == nil 
          someFilesToSubNodesNamesToCardinalities = Hash.new(); @@fullPathToFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.fullNodePath] = someFilesToSubNodesNamesToCardinalities ; 
        end
        someSubNodesNamesToCardinalities = someFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.file]
        if someSubNodesNamesToCardinalities == nil
          someSubNodesNamesToCardinalities = Hash.new(0); someFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.file] = someSubNodesNamesToCardinalities
          someSubNodesNamesToCardinalities.update(aNodeAnalyzer.subNodesNamesToCardinalities)
        else
          aNodeAnalyzer.subNodesNamesToCardinalities.each() do |aSubNodeName, aCardinality|
            someSubNodesNamesToCardinalities[aSubNodeName] = someSubNodesNamesToCardinalities[aSubNodeName] + aCardinality
          end
        end  
        #puts "someSubNodesNamesToCardinalities for #{aNodeAnalyzer.fullNodePath}: #{someSubNodesNamesToCardinalities}"
      end
      def file
        #if @file.empty? then @father.file else return @file end
        if @file.empty? then if @father != nil then return @father.file else return '' end else return @file end
      end
      def fullNodePath
        if @father == nil then return '' elsif @father.name.empty? then return @name else return @father.fullNodePath+"/"+@name end
      end
        def to_s
        s = ""
        if @name.empty? == false
          s = "#{@indent}#{self.fullNodePath} [#{self.file}]\n"
        end
        @subNodesAnalyzers.each() do |aSubNodeAnalyzer|
          s = s + aSubNodeAnalyzer.to_s
        end
        return s
        end
      def NodeAnalyzer.displayStats(aFullPath="")
        s = "";
        if aFullPath.empty? then s = "Statistical Elements Analysis of #{@@nodesRootAnalyzer.subNodesAnalyzers.length} xml documents with #{@@fullPaths.length} elements\n" end
        someFullPaths = @@fullPaths.sort
        someFullPaths.each do |aFullPath|
          s = s + getIndentedNameFromFullPath(aFullPath) + "*"
          nbFilesWithThatFullPath = getNbFilesWithThatFullPath(aFullPath);
          aParentFullPath = getParentFullPath(aFullPath)
          nbFilesWithParentFullPath = getNbFilesWithThatFullPath(aParentFullPath);
          aNameFromFullPath = getNameFromFullPath(aFullPath)
          someFilesToSubNodesNamesToCardinalities = @@fullPathToFilesToSubNodesNamesToCardinalities[aParentFullPath]
          someCardinalities = Array.new()
          someFilesToSubNodesNamesToCardinalities.each() do |aFile, someSubNodesNamesToCardinalities|
            aCardinality = someSubNodesNamesToCardinalities[aNameFromFullPath]
            if aCardinality > 0 && someCardinalities.include?(aCardinality) == false then someCardinalities << aCardinality end
          end
          if someCardinalities.length == 1
            s = s + someCardinalities.to_s + " "
          else
            anAvg = someCardinalities.inject(0) {|sum,value| Float(sum) + Float(value) } / Float(someCardinalities.length)
            s = s + sprintf('%.1f', anAvg) + " " + someCardinalities.min.to_s + "..." + someCardinalities.max.to_s + " "
          end
          s = s + sprintf('%d', Float(nbFilesWithThatFullPath) / Float(nbFilesWithParentFullPath) * 100) + '%'
          s = s + "\n"
        end
        return s
      end
      def NodeAnalyzer.getNameFromFullPath(aFullPath)
        if aFullPath.include?("/") == false then return aFullPath end
        aNameFromFullPath = aFullPath.dup
        aNameFromFullPath[/^(?:[^\/]+\/)+/] = ""
        return aNameFromFullPath
      end
      def NodeAnalyzer.getIndentedNameFromFullPath(aFullPath)
        if aFullPath.include?("/") == false then return aFullPath end
        anIndentedNameFromFullPath = aFullPath.dup
        anIndentedNameFromFullPath = anIndentedNameFromFullPath.gsub(/[^\/]+\//, "  ")
        return anIndentedNameFromFullPath
      end
      def NodeAnalyzer.getParentFullPath(aFullPath)
        if aFullPath.include?("/") == false then return "" end
        aParentFullPath = aFullPath.dup
        aParentFullPath[/\/[^\/]+$/] = ""
        return aParentFullPath
      end
      def NodeAnalyzer.getNbFilesWithThatFullPath(aFullPath)
        if aFullPath.empty? 
          return @@nodesRootAnalyzer.subNodesAnalyzers.length
        else
          return @@fullPathsToFiles[aFullPath].length;
        end
      end
    end
    class REXML::Document
        def analyze(node, aFatherNodeAnalyzer, aFile="")
        anNodeAnalyzer = NodeAnalyzer.new(node.name, aFatherNodeAnalyzer, aFile)
        node.elements.each() do |aSubNode| analyze(aSubNode, anNodeAnalyzer) end
        NodeAnalyzer.recordNode(anNodeAnalyzer)
      end
    end
    
    begin
      anXmlFilesDirectory = "xmlfiles.com/examples/"
      anXmlFilesRegExp = Regexp.new("http:\/\/" + anXmlFilesDirectory + "([^\"]*)")
      a = Net::HTTP.get(URI("http://www.google.fr/search?q=site:"+anXmlFilesDirectory+"+filetype:xml&num=100&as_qdr=all&filter=0"))
      someXmlFiles = a.scan(anXmlFilesRegExp)
      someXmlFiles.each() do |anXmlFile|
        anXmlFileContent = Net::HTTP.get(URI("http://" + anXmlFilesDirectory + anXmlFile.to_s))
        anUTF8XmlFileContent = Iconv.conv("ISO-8859-1//ignore", 'UTF-8', anXmlFileContent).gsub(/\s+encoding\s*=\s*\"[^\"]+\"\s*\?/,"?")
        anXmlDocument = Document.new(anUTF8XmlFileContent)
        puts "Analyzing #{anXmlFile}: #{NodeAnalyzer.nodesRootAnalyzer.name}"
        anXmlDocument.analyze(anXmlDocument.root,NodeAnalyzer.nodesRootAnalyzer, anXmlFile.to_s)
      end
      NodeAnalyzer.recordNode(NodeAnalyzer.nodesRootAnalyzer)
      puts NodeAnalyzer.displayStats
    end
    
    JesperE : Ouch. If you just used a little shorter identifiers, the code might even be readable.
    From VonC
  • Go with JeniT's answer - she's one of the first XSLT guru's I started learning from back on '02. To really appreciate the power of XML you should work with XPath and XSLT and learn to manipulate the nodes.

    VonC : I agree, JeniT's answer is the right track... but it is not a full script that I can just launch and check if it works. Feel free to write a more complete xslt script to answer the challenge ;)
    David Robbins : Jeni assumes that you concatenate the docs before you run the script/ process. Wouldn't that suffice? You could concat the files with file streams, load up the XML/XSLT namespace stuff and execute.
    JeniT : Actually, I don't assume that the docs are concatenated, but that they're in an XPath sequence of document nodes, which is most easily generated using `collection()`. I've updated my answer above to give the full stylesheet.

0 comments:

Post a Comment