Monday, April 25, 2011

In Haskell how do you extract strings from an XML document?

If I have an XML document like this:

<root>
  <elem name="Greeting">
    Hello
  </elem>
  <elem name="Name">
    Name
  </elem>
</root>

and some Haskell type/data definitions like this:

 type Name = String
 type Value = String
 data LocalizedString = LS Name Value

and I wanted to write a Haskell function with the following signature:

 getLocalizedStrings :: String -> [LocalizedString]

where the first parameter was the XML text, and the returned value was:

 [LS "Greeting" "Hello", LS "Name" "Name"]

how would I do this?

If HaXml is the best tool, how would I use HaXml to achieve the above goal?

Thank!

From stackoverflow
  • I've never actually bothered to figure out how to extract bits out of XML documents using HaXML; HXT has met all my needs.

    {-# LANGUAGE Arrows #-}
    import Data.Maybe
    import Text.XML.HXT.Arrow
    
    type Name = String
    type Value = String
    data LocalizedString = LS Name Value
    
    getLocalizedStrings :: String -> Maybe [LocalizedString]
    getLocalizedStrings = (.) listToMaybe . runLA $ xread >>> getRoot
    
    atTag :: ArrowXml a => String -> a XmlTree XmlTree
    atTag tag = deep $ isElem >>> hasName tag
    
    getRoot :: ArrowXml a => a XmlTree [LocalizedString]
    getRoot = atTag "root" >>> listA getElem
    
    getElem :: ArrowXml a => a XmlTree LocalizedString
    getElem = atTag "elem" >>> proc x -> do
        name <- getAttrValue "name" -< x
        value <- getChildren >>> getText -< x
        returnA -< LS name value
    

    You'd probably like a little more error-checking (i.e. don't just lazily use atTag like me; actually verify that <root> is root, <elem> is direct descendent, etc.) but this works just fine on your example.


    Now, if you need an introduction to Arrows, unfortunately I don't know of any good one. I myself learned it the "thrown into the ocean to learn how to swim" way.

    Something that may be helpful to keep in mind is that the proc/-< syntax is simply sugar for the basic arrow operations (arr, >>>, etc.), just like do/<- is simply sugar for the basic monad operations (return, >>=, etc.). The following are equivalent:

    getAttrValue "name" &&& (getChildren >>> getText) >>^ uncurry LS
    
    proc x -> do
        name <- getAttrValue "name" -< x
        value <- getChildren >>> getText -< x
        returnA -< LS name value
    
    Tim Stewart : Thank you very much for a very informative answer!
    Paul Johnson : There is a HXT tutorial at http://www.haskell.org/haskellwiki/HXT, but it is relentlessly point-free, so understanding how this relates to the arrow do-notation (as in the example above) is not easy.
  • FWIW, HXT seems like overkill where a simple TagSoup will do :)

  • Here's my second attempt (after receiving some good input from others) with TagSoup:

    module Xml where
    
    import Data.Char
    import Text.HTML.TagSoup
    
    type SName = String
    type SValue = String
    
    data LocalizedString = LS SName SValue
         deriving Show
    
    getLocalizedStrings :: String -> [LocalizedString]
    getLocalizedStrings = create . filterTags . parseTags
      where 
        filterTags :: [Tag] -> [Tag]
        filterTags = filter (\x -> isTagOpenName "elem" x || isTagText x)
    
        create :: [Tag] -> [LocalizedString]
        create (TagOpen "elem" [("name", name)] : TagText text : rest) = 
          LS name (trimWhiteSpace text) : create rest
        create (_:rest) = create rest
        create [] = []               
    
    trimWhiteSpace :: String -> String
    trimWhiteSpace = dropWhile isSpace . reverse . dropWhile isSpace . reverse
    
    main = do
      xml <- readFile "xml.xml"  -- xml.xml contains the xml in the original question.
      putStrLn . show . getLocalizedStrings $ xml
    

    The first attempt showcased a naive (and faulty) method for trimming whitespace off of a string.

    ephemient : TagSoup happily accepts malformed input -- which you might actually like :) -- unfortunately IMO this solution is harder to read. Minor nit: I'd have expected something more like `trimWhiteSpace = dropWhile isSpace . reverse . dropWhile isSpace . reverse`; yours is more like `removeAllWhiteSpace`.
    Tim Stewart : Thanks ephemient. I should have had some better sample data. :) I'll have to make sure that isSpace gets rid of newlines because I had some newlines embedded in my XML.
    ephemient : Just try for yourself: type `Data.Char.isSpace '\n'` into GHCi. Yes, newlines are, and have always been, whitespace. My nit wasn't about that, more along the lines of your `trimWhiteSpace " a b c " == "abc"` which is non-intuitive to me. Or maybe I'm strange.
    Tim Stewart : You're absolutely right. I want to keep those internal spaces. Thanks.
  • Use one of the XML packages.

    The most popular are, in order,

    1. haxml
    2. hxt
    3. xml-light
    4. hexpat

0 comments:

Post a Comment