This topic contains 6 replies, has 2 voices, and was last updated by  Eliot Muir 3 years, 5 months ago.

xml.parse() reads incomplete (thus, malformed) XML

  • Good morning,

    we have stumpled upon a behaviour of Iguana which is most probably “works as designed”… Check this out:

       local content = "<hallo><huhu hoho='lala'>456</huhu><ha"
       local xmltree = xml.parse({data = content})

    The XML is incomplete, but xml.parse() reads it nonetheless. It ignores a started tag at the end, and it implicitly adds missing close tags.

    This makes XML parsing more robust (Postel’s law rears its ugly head), but is there a way in the API to check an XML for wellformedness? I could search the string for the closing root tag; is this sufficient?

    Interesting – I have raised ticket #24788 about it.

    Hey Robin,

    Just wanted to give you a little feedback on this. Jonathan had a look into it and found the issue. Under the hood we use the expat XML parser which is an event orientated XML parser – ala SAX.

    The <ha is just seen as trailing character data.

    We get a stream of events and it’s possible for us to maintain a count of Start tags encountered and corresponding end tags. So we detect the problem. We’ve just been debating a little internally as to what the best fix is.

    There are applications when it’s useful to have a XML parser that can handle a continuous stream. But really xml.parse isn’t that type of interface – it’s a DOM style parse which takes a complete XML document.

    So basically we should be able to modify xml.parse so that it can say something along the lines of “this is a incomplete xml document – was expected closing tag” – it should be a reasonable solution – we’ll probably get that into Iguana 5.6.8.

    Thanks for bringing this to our attention – as always impressed with your attention to detail…

    Have a great weekend!

    Hi Eliot,

    thank you for the friendly words! We found this because our upstream instruments apparently take some time to write their output files, and we sometimes had bad luck and tried to read a file which is not yet finished. We also plan to test whether the file is still locked, by io.open(filename, 'r+') and check for the EACCES (Permission denied) error.

    Cheers, Robin

    Ah okay.

    The ideal thing would be to have those instruments write to a temp file, close it and rename it when it’s done. But you may not have the ability to modify their behavior.

    Another good trick is to look at the age of the files – os.fs.* has a function to do that from memory which would give you that information.

    Good luck!

    Yeah, you can say that again! We just have to put up with what we get from them.

    Also, thank you for the trick with the file age. That may be nicer than provoking an IO error.
    And of course, thank you as well for the fast reaction about xml.parse!

    You are welcome – it’s a very common problem. Good luck!

You must be logged in to reply to this topic.