Friday 14 June 2019

.net - why does xmltextreader convert html encoded utf8 characters to utf8 string automatically?



I receive an XML file with encoding "ISO-8859-1" (Latin-1)



Within the file (among other tags) I have Example "content" And ─



Now for some reason when I load this into XMLTextReader and do a "XmlReader.Value" to return the value, it returns: "content" And ─



This then, when confronted with a database only accepting Latin-1 encoding, obviously errors.




I have tried the following:




  • Converting into bytes and using
    Encoding.Convert to change from UTF-8
    into Latin-1 (which successfully
    gives me a bunch of "?" instead)

  • Using
    StreamReader(file,Encoding.whatever)
    to load the file into XmlTextReader




And several variations there-of and different methods on the internet and on StackOverflow istelf.



I understand that .NET strings are UTF-16, but what I don't understand is why, a fully Latin-1 formatted XML file with CORRECT markup for when UTF-8 characters exist which is compatible with older databases AND the web (for HTML markup etc) that it simply overrides that and output's the UTF-8 encoded string ANYWAY.



Is there noway to get around this other than writing my own custom text parser???


Answer



I do not believe this is a problem with the encoding. What you're seeing is the XML string being un-escaped.




The problem is " is a XML escape character, so XMLTextReader will un-escape this for you.



If you change this:



Example "content" And ─


To this:



Example "content" And ─



Then



   XmlReader.Value = ""content" And ─";


You'll need to wrap your value in CDATA so it is ignored by the parser.



Another option is to re-escape the string:




    using System.Security;
....
....
string val = SecurityElement.Escape(xmlReader.Value);

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...