If I have an XML document like this:
<root>
<elem name="Greeting">
Hello
</elem>
<elem name="Name">
Name
</elem>
</root>
and some Haskell type/data definitions like this:
type Name = String
type Value = String
data LocalizedString = LS Name Value
and I wanted to write a Haskell function with the following signature:
getLocalizedStrings :: String -> [LocalizedString]
where the first parameter was the XML text, and the returned value was:
[LS "Greeting" "Hello", LS "Name" "Name"]
how would I do this?
If HaXml is the best tool, how would I use HaXml to achieve the above goal?
Thank!
-
I've never actually bothered to figure out how to extract bits out of XML documents using HaXML; HXT has met all my needs.
{-# LANGUAGE Arrows #-} import Data.Maybe import Text.XML.HXT.Arrow type Name = String type Value = String data LocalizedString = LS Name Value getLocalizedStrings :: String -> Maybe [LocalizedString] getLocalizedStrings = (.) listToMaybe . runLA $ xread >>> getRoot atTag :: ArrowXml a => String -> a XmlTree XmlTree atTag tag = deep $ isElem >>> hasName tag getRoot :: ArrowXml a => a XmlTree [LocalizedString] getRoot = atTag "root" >>> listA getElem getElem :: ArrowXml a => a XmlTree LocalizedString getElem = atTag "elem" >>> proc x -> do name <- getAttrValue "name" -< x value <- getChildren >>> getText -< x returnA -< LS name valueYou'd probably like a little more error-checking (i.e. don't just lazily use
atTaglike me; actually verify that<root>is root,<elem>is direct descendent, etc.) but this works just fine on your example.
Now, if you need an introduction to Arrows, unfortunately I don't know of any good one. I myself learned it the "thrown into the ocean to learn how to swim" way.
Something that may be helpful to keep in mind is that the
proc/-<syntax is simply sugar for the basic arrow operations (arr,>>>, etc.), just likedo/<-is simply sugar for the basic monad operations (return,>>=, etc.). The following are equivalent:getAttrValue "name" &&& (getChildren >>> getText) >>^ uncurry LS proc x -> do name <- getAttrValue "name" -< x value <- getChildren >>> getText -< x returnA -< LS name valueTim Stewart : Thank you very much for a very informative answer!Paul Johnson : There is a HXT tutorial at http://www.haskell.org/haskellwiki/HXT, but it is relentlessly point-free, so understanding how this relates to the arrow do-notation (as in the example above) is not easy. -
FWIW, HXT seems like overkill where a simple TagSoup will do :)
-
Here's my second attempt (after receiving some good input from others) with TagSoup:
module Xml where import Data.Char import Text.HTML.TagSoup type SName = String type SValue = String data LocalizedString = LS SName SValue deriving Show getLocalizedStrings :: String -> [LocalizedString] getLocalizedStrings = create . filterTags . parseTags where filterTags :: [Tag] -> [Tag] filterTags = filter (\x -> isTagOpenName "elem" x || isTagText x) create :: [Tag] -> [LocalizedString] create (TagOpen "elem" [("name", name)] : TagText text : rest) = LS name (trimWhiteSpace text) : create rest create (_:rest) = create rest create [] = [] trimWhiteSpace :: String -> String trimWhiteSpace = dropWhile isSpace . reverse . dropWhile isSpace . reverse main = do xml <- readFile "xml.xml" -- xml.xml contains the xml in the original question. putStrLn . show . getLocalizedStrings $ xmlThe first attempt showcased a naive (and faulty) method for trimming whitespace off of a string.
ephemient : TagSoup happily accepts malformed input -- which you might actually like :) -- unfortunately IMO this solution is harder to read. Minor nit: I'd have expected something more like `trimWhiteSpace = dropWhile isSpace . reverse . dropWhile isSpace . reverse`; yours is more like `removeAllWhiteSpace`.Tim Stewart : Thanks ephemient. I should have had some better sample data. :) I'll have to make sure that isSpace gets rid of newlines because I had some newlines embedded in my XML.ephemient : Just try for yourself: type `Data.Char.isSpace '\n'` into GHCi. Yes, newlines are, and have always been, whitespace. My nit wasn't about that, more along the lines of your `trimWhiteSpace " a b c " == "abc"` which is non-intuitive to me. Or maybe I'm strange.Tim Stewart : You're absolutely right. I want to keep those internal spaces. Thanks. -
Use one of the XML packages.
The most popular are, in order,
- haxml
- hxt
- xml-light
- hexpat
0 comments:
Post a Comment