I have a lot of XML files and I'd like to generate a report from them. The report should provide information such as:
root 100%
a*1 90%
b*1 80%
c*5 40%
meaning that all documents have a root element, 90% have one a element in the root, 80% have one b element in the root, 40% have 5 c elements in b.
If for example some documents have 4 c elements, some 5 and some 6, it should say something like:
c*4.3 4 6 40%
meaning that 40% have between 4 and 6 c elements there, and the average is 4.3.
I am looking for free software, if it doesn't exist I'll write it. I was about to do it, but I thought about checking it. I may not be the first one to have to analyze and get an structural overview of thousand of XML files.
-
Here's an XSLT 2.0 method.
Assuming that
$docs
contains a sequence of document nodes that you want to scan, you want to create one line for each element that appears in the documents. You can use<xsl:for-each-group>
to do that:<xsl:for-each-group select="$docs//*" group-by="name()"> <xsl:sort select="current-group-key()" /> <xsl:variable name="name" as="xs:string" select="current-grouping-key()" /> <xsl:value-of select="$name" /> ... </xsl:for-each-group>
Then you want to find out the stats for that element amongst the documents. First, find the documents have an element of that name in them:
<xsl:variable name="docs-with" as="document-node()+" select="$docs[//*[name() = $name]" />
Second, you need a sequence of the number of elements of that name in each of the documents:
<xsl:variable name="elem-counts" as="xs:integer+" select="$docs-with/count(//*[name() = $name])" />
And now you can do the calculations. Average, minimum and maximum can be calculated with the
avg()
,min()
andmax()
functions. The percentage is simply the number of documents that contain the element divided by the total number of documents, formatted.Putting that together:
<xsl:for-each-group select="$docs//*" group-by="name()"> <xsl:sort select="current-group-key()" /> <xsl:variable name="name" as="xs:string" select="current-grouping-key()" /> <xsl:variable name="docs-with" as="document-node()+" select="$docs[//*[name() = $name]" /> <xsl:variable name="elem-counts" as="xs:integer+" select="$docs-with/count(//*[name() = $name])" /> <xsl:value-of select="$name" /> <xsl:text>* </xsl:text> <xsl:value-of select="format-number(avg($elem-counts), '#,##0.0')" /> <xsl:text> </xsl:text> <xsl:value-of select="format-number(min($elem-counts), '#,##0')" /> <xsl:text> </xsl:text> <xsl:value-of select="format-number(max($elem-counts), '#,##0')" /> <xsl:text> </xsl:text> <xsl:value-of select="format-number((count($docs-with) div count($docs)) * 100, '#0')" /> <xsl:text>%</xsl:text> <xsl:text>
</xsl:text> </xsl:for-each-group>
What I haven't done here is indented the lines according to the depth of the element. I've just ordered the elements alphabetically to give you statistics. Two reasons for that: first, it's significantly harder (like too involved to write here) to display the element statistics in some kind of structure that reflects how they appear in the documents, not least because different documents may have different structures. Second, in many markup languages, the precise structure of the documents can't be known (because, for example, sections can nest within sections to any depth).
I hope it's useful none the less.
UPDATE:
Need the XSLT wrapper and some instructions for running XSLT? OK. First, get your hands on Saxon 9B.
You'll need to put all the files you want to analyse in a directory. Saxon allows you to access all the files in that directory (or its subdirectories) using a collection using a special URI syntax. It's worth having a look at that syntax if you want to search recursively or filter the files that you're looking at by their filename.
Now the full XSLT:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs"> <xsl:param name="dir" as="xs:string" select="'file:///path/to/default/directory?select=*.xml'" /> <xsl:output method="text" /> <xsl:variable name="docs" as="document-node()*" select="collection($dir)" /> <xsl:template name="main"> <xsl:for-each-group select="$docs//*" group-by="name()"> <xsl:sort select="current-group-key()" /> <xsl:variable name="name" as="xs:string" select="current-grouping-key()" /> <xsl:variable name="docs-with" as="document-node()+" select="$docs[//*[name() = $name]" /> <xsl:variable name="elem-counts" as="xs:integer+" select="$docs-with/count(//*[name() = $name])" /> <xsl:value-of select="$name" /> <xsl:text>* </xsl:text> <xsl:value-of select="format-number(avg($elem-counts), '#,##0.0')" /> <xsl:text> </xsl:text> <xsl:value-of select="format-number(min($elem-counts), '#,##0')" /> <xsl:text> </xsl:text> <xsl:value-of select="format-number(max($elem-counts), '#,##0')" /> <xsl:text> </xsl:text> <xsl:value-of select="format-number((count($docs-with) div count($docs)) * 100, '#0')" /> <xsl:text>%</xsl:text> <xsl:text>
</xsl:text> </xsl:for-each-group> </xsl:template> </xsl:stylesheet>
And to run it you would do something like:
> java -jar path/to/saxon.jar -it:main -o:report.txt dir=file:///path/to/your/directory?select=*.xml
This tells Saxon to start the process with the template named
main
, to set thedir
parameter tofile:///path/to/your/directory?select=*.xml
and send the output toreport.txt
.David Robbins : Jeni, I'm a big fan of yours, as your site has helped me with a great many challenges with XPath. I'm glad that you are joining us here at StackOverflow.David Robbins : Saw your update - now I get it :)From JeniT -
Check out Gadget
From Mads Hansen -
Beautiful Soup makes parsing XML trivial in python.
From jeremy -
[community post, here: no karma involved;) ]
I propose a code-challenge here:parse all xml find in xmlfiles.com/examples and try to come up with the following output:
Analyzing plant_catalog.xml: Analyzing note.xml: Analyzing portfolio.xml: Analyzing note_ex_dtd.xml: Analyzing home.xml: Analyzing simple.xml: Analyzing cd_catalog.xml: Analyzing portfolio_xsl.xml: Analyzing note_in_dtd.xml: Statistical Elements Analysis of 9 xml documents with 34 elements CATALOG*2 22% CD*26 50% ARTIST*26 100% COMPANY*26 100% COUNTRY*26 100% PRICE*26 100% TITLE*26 100% YEAR*26 100% PLANT*36 50% AVAILABILITY*36 100% BOTANICAL*36 100% COMMON*36 100% LIGHT*36 100% PRICE*36 100% ZONE*36 100% breakfast-menu*1 11% food*5 100% calories*5 100% description*5 100% name*5 100% price*5 100% note*3 33% body*1 100% from*1 100% heading*1 100% to*1 100% page*1 11% para*1 100% title*1 100% portfolio*2 22% stock*2 100% name*2 100% price*2 100% symbol*2 100%
tye : This challenge doesn't honor the second part of the original question, tracking how many times a tag appears.VonC : I am not sure I follow: the number of occurrence of a tag is right after the '*' . It that element were to appear with a different cardinality in different xml files, it would be displayed as "min...max average" cardinality. Since it is not the case in the xmlcodes example files, you have only '*x'.From VonC -
Here is a possible solution in ruby to this code-challenge...
Since it is my very first ruby program, I am sure it is quite terribly coded, but at least it may answer J. Pablo Fernandez's question.Copy-paste it in a '.rb file and calls ruby on it. If you have an Internet connection, it will work ;)
require "rexml/document" require "net/http" require "iconv" include REXML class NodeAnalyzer @@fullPathToFilesToSubNodesNamesToCardinalities = Hash.new() @@fullPathsToFiles = Hash.new() #list of files in which a fullPath node is detected @@fullPaths = Array.new # all fullpaths sorted alphabetically attr_reader :name, :father, :subNodesAnalyzers, :indent, :file, :subNodesNamesToCardinalities def initialize(aName="", aFather=nil, aFile="") @name = aName; @father = aFather; @subNodesAnalyzers = []; @file = aFile @subNodesNamesToCardinalities = Hash.new(0) if aFather && !aFather.name.empty? then @indent = " " else @indent = "" end if aFather @indent = @father.indent + self.indent @father.subNodesAnalyzers << self @father.updateSubNodesNamesToCardinalities(@name) end end @@nodesRootAnalyzer = NodeAnalyzer.new def NodeAnalyzer.nodesRootAnalyzer return @@nodesRootAnalyzer end def updateSubNodesNamesToCardinalities(aSubNodeName) aSubNodeCardinality = @subNodesNamesToCardinalities[aSubNodeName] @subNodesNamesToCardinalities[aSubNodeName] = aSubNodeCardinality + 1 end def NodeAnalyzer.recordNode(aNodeAnalyzer) if aNodeAnalyzer.fullNodePath.empty? == false if @@fullPaths.include?(aNodeAnalyzer.fullNodePath) == false then @@fullPaths << aNodeAnalyzer.fullNodePath end # record a full path in regard to its xml file (records it only one for a given xlm file) someFiles = @@fullPathsToFiles[aNodeAnalyzer.fullNodePath] if someFiles == nil someFiles = Array.new(); @@fullPathsToFiles[aNodeAnalyzer.fullNodePath] = someFiles; end if !someFiles.include?(aNodeAnalyzer.file) then someFiles << aNodeAnalyzer.file end end #record cardinalties of sub nodes for a given xml file someFilesToSubNodesNamesToCardinalities = @@fullPathToFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.fullNodePath] if someFilesToSubNodesNamesToCardinalities == nil someFilesToSubNodesNamesToCardinalities = Hash.new(); @@fullPathToFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.fullNodePath] = someFilesToSubNodesNamesToCardinalities ; end someSubNodesNamesToCardinalities = someFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.file] if someSubNodesNamesToCardinalities == nil someSubNodesNamesToCardinalities = Hash.new(0); someFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.file] = someSubNodesNamesToCardinalities someSubNodesNamesToCardinalities.update(aNodeAnalyzer.subNodesNamesToCardinalities) else aNodeAnalyzer.subNodesNamesToCardinalities.each() do |aSubNodeName, aCardinality| someSubNodesNamesToCardinalities[aSubNodeName] = someSubNodesNamesToCardinalities[aSubNodeName] + aCardinality end end #puts "someSubNodesNamesToCardinalities for #{aNodeAnalyzer.fullNodePath}: #{someSubNodesNamesToCardinalities}" end def file #if @file.empty? then @father.file else return @file end if @file.empty? then if @father != nil then return @father.file else return '' end else return @file end end def fullNodePath if @father == nil then return '' elsif @father.name.empty? then return @name else return @father.fullNodePath+"/"+@name end end def to_s s = "" if @name.empty? == false s = "#{@indent}#{self.fullNodePath} [#{self.file}]\n" end @subNodesAnalyzers.each() do |aSubNodeAnalyzer| s = s + aSubNodeAnalyzer.to_s end return s end def NodeAnalyzer.displayStats(aFullPath="") s = ""; if aFullPath.empty? then s = "Statistical Elements Analysis of #{@@nodesRootAnalyzer.subNodesAnalyzers.length} xml documents with #{@@fullPaths.length} elements\n" end someFullPaths = @@fullPaths.sort someFullPaths.each do |aFullPath| s = s + getIndentedNameFromFullPath(aFullPath) + "*" nbFilesWithThatFullPath = getNbFilesWithThatFullPath(aFullPath); aParentFullPath = getParentFullPath(aFullPath) nbFilesWithParentFullPath = getNbFilesWithThatFullPath(aParentFullPath); aNameFromFullPath = getNameFromFullPath(aFullPath) someFilesToSubNodesNamesToCardinalities = @@fullPathToFilesToSubNodesNamesToCardinalities[aParentFullPath] someCardinalities = Array.new() someFilesToSubNodesNamesToCardinalities.each() do |aFile, someSubNodesNamesToCardinalities| aCardinality = someSubNodesNamesToCardinalities[aNameFromFullPath] if aCardinality > 0 && someCardinalities.include?(aCardinality) == false then someCardinalities << aCardinality end end if someCardinalities.length == 1 s = s + someCardinalities.to_s + " " else anAvg = someCardinalities.inject(0) {|sum,value| Float(sum) + Float(value) } / Float(someCardinalities.length) s = s + sprintf('%.1f', anAvg) + " " + someCardinalities.min.to_s + "..." + someCardinalities.max.to_s + " " end s = s + sprintf('%d', Float(nbFilesWithThatFullPath) / Float(nbFilesWithParentFullPath) * 100) + '%' s = s + "\n" end return s end def NodeAnalyzer.getNameFromFullPath(aFullPath) if aFullPath.include?("/") == false then return aFullPath end aNameFromFullPath = aFullPath.dup aNameFromFullPath[/^(?:[^\/]+\/)+/] = "" return aNameFromFullPath end def NodeAnalyzer.getIndentedNameFromFullPath(aFullPath) if aFullPath.include?("/") == false then return aFullPath end anIndentedNameFromFullPath = aFullPath.dup anIndentedNameFromFullPath = anIndentedNameFromFullPath.gsub(/[^\/]+\//, " ") return anIndentedNameFromFullPath end def NodeAnalyzer.getParentFullPath(aFullPath) if aFullPath.include?("/") == false then return "" end aParentFullPath = aFullPath.dup aParentFullPath[/\/[^\/]+$/] = "" return aParentFullPath end def NodeAnalyzer.getNbFilesWithThatFullPath(aFullPath) if aFullPath.empty? return @@nodesRootAnalyzer.subNodesAnalyzers.length else return @@fullPathsToFiles[aFullPath].length; end end end class REXML::Document def analyze(node, aFatherNodeAnalyzer, aFile="") anNodeAnalyzer = NodeAnalyzer.new(node.name, aFatherNodeAnalyzer, aFile) node.elements.each() do |aSubNode| analyze(aSubNode, anNodeAnalyzer) end NodeAnalyzer.recordNode(anNodeAnalyzer) end end begin anXmlFilesDirectory = "xmlfiles.com/examples/" anXmlFilesRegExp = Regexp.new("http:\/\/" + anXmlFilesDirectory + "([^\"]*)") a = Net::HTTP.get(URI("http://www.google.fr/search?q=site:"+anXmlFilesDirectory+"+filetype:xml&num=100&as_qdr=all&filter=0")) someXmlFiles = a.scan(anXmlFilesRegExp) someXmlFiles.each() do |anXmlFile| anXmlFileContent = Net::HTTP.get(URI("http://" + anXmlFilesDirectory + anXmlFile.to_s)) anUTF8XmlFileContent = Iconv.conv("ISO-8859-1//ignore", 'UTF-8', anXmlFileContent).gsub(/\s+encoding\s*=\s*\"[^\"]+\"\s*\?/,"?") anXmlDocument = Document.new(anUTF8XmlFileContent) puts "Analyzing #{anXmlFile}: #{NodeAnalyzer.nodesRootAnalyzer.name}" anXmlDocument.analyze(anXmlDocument.root,NodeAnalyzer.nodesRootAnalyzer, anXmlFile.to_s) end NodeAnalyzer.recordNode(NodeAnalyzer.nodesRootAnalyzer) puts NodeAnalyzer.displayStats end
JesperE : Ouch. If you just used a little shorter identifiers, the code might even be readable.From VonC -
Go with JeniT's answer - she's one of the first XSLT guru's I started learning from back on '02. To really appreciate the power of XML you should work with XPath and XSLT and learn to manipulate the nodes.
VonC : I agree, JeniT's answer is the right track... but it is not a full script that I can just launch and check if it works. Feel free to write a more complete xslt script to answer the challenge ;)David Robbins : Jeni assumes that you concatenate the docs before you run the script/ process. Wouldn't that suffice? You could concat the files with file streams, load up the XML/XSLT namespace stuff and execute.JeniT : Actually, I don't assume that the docs are concatenated, but that they're in an XPath sequence of document nodes, which is most easily generated using `collection()`. I've updated my answer above to give the full stylesheet.From David Robbins
0 comments:
Post a Comment