Gaaaarrrrrr.... Yet another standards related problem to keep me scatching my head for hours. I developed and now run the website Rural Escapes. This is great, I really enjoy doing it even though at the moment it's not earning me much but it also means I get the job of debugging and fixing it when things go wrong. Todays little problem is foreign characters not displaying correctly.
While developing the site I went to great lengths to make sure that I was able to correctly display all the characters that are used in Europe as Rural Escapes is a Europe wide service. To this end I made sure that all pages sent back stated UTF-8 as their character encoding. Everything seemed to be working. All the foreign characters I entered seemed to be displayed correctly so I was happy. What I forgot to check though was what happens when non-ASCII characters are sent up as form data. Well it turns out that Tomcat (and probably every other container and web server) interprets them using whatever default encoding it is set to use. In the case of Tomcat this seems to be ISO-8859-1 which means that it mangles characters such as £. The reason for this monumental screw up - you guessed it IE.
There is a header called Content-Type which is sent up with POST data which should have the format
Content-type: application/x-www-form-urlencoded; charset=UTF-8
however back at the dawn of time Microsoft, when developing IE, left off the "; charset-UTF-8" parameter. At the time this wasn't so bad because basically everywhere used ISO-8859-1 so you could be pretty sure everything would be interpreted correctly. Now though a multitude of different character sets are used and you can't rely on the data being sent using one particular encoding which leads to problems.
There is partial solution to this problem but it's not pretty. Mozilla and IE (at least) can now include an extra parameter in a post request called "_charset_" (as described in this Mozilla bug report) which can be used to determine the character set of the posted data. There are some potential problems with parameter name clashes but they are probably quite minimal. To include this extra parameter you simply add the attribute "accept-charset" to your form element, cross your fingers, and place a hidden form field called "_charset_" in the form. The browser will then fill in that field when it sends up the data. It is described in this w3c document. A word of warning though; although this works the Java Servlet API doesn't recognize this as a valid way of specifying the form data character set and therefore a call to request.getCharacterEncoding() still returns null.
You might be wondering why Microsoft and Mozilla don't just correctly implement the specification. It's quite simple really: Microsoft have been doing it wrong for so long that numerous websites now rely on the incorrectly specified header and handle the correct header very badly. The Mozilla team tried to include the correct header and ran into serious compatibility problems so removed it again.
So are there any nice solutions? There is a de-facto standards solution. Both IE and Mozilla send back form data encoded using whatever encoding the page was supplied in even though they don't actually set the header correctly. This, ironically, is the root cause of my problems. The data has been coming up as UTF-8 but because the header isn't set the servlet spec has been decoding it as ISO-8859-1 and screwing it up. If you want to read more about this problem I suggest you have a look here for a great review. This page describes using UTF-8 (well Unicode) with Linux and other posix system and details a number of problems.
If you are using Java Servlets to process forms the simplest, and probably most effective, way to ensure you get all the correct characters is to make sure you set the page character set everywhere you can (headers and meta data) and then rely on the browser using that character set when submitting the form. This will work as long as the user doesn't change the content type before submitting the form but as most users don't know what "character type" means they will leave that setting well alone. The only modification you have to make to you code is to ensure that you call request.setCharacterEncoding() BEFORE you read ANY parameters from the request. This technique is discussed here.
Incorrect Encoding With SAXBuilder
One part of the Rural Escapes software builds a JDOM document from a string retrieved from the database. The database is set to use UTF-8 and all UTF-8 strings seem to be stored correctly. The problem I was having was that on my development computer all strings recovered from the database would be displayed correctly but when the exact same code (with the same version of the JVM and all the same libraries and the exact same version of Tomcat) would mangle the display or UTF-8 characters - typically displaying them as a question mark. I thought the problem would be fixed by the above solution but it wasn't although it did help as at least with the above solution in place double byte characters were being correctly interpreted if not displayed. I am not sure exactly how I fixed this problem but while doing some debugging I noticed that UTF-8 characters were being displayed as ? in the terminal window as well. I guessed that the terminal was set to some other encoding that couldn't display such far fetched characters as £. I reconfigured my system to use UTF-8 throughout and restart the terminal but to no avail the UTF-8 characters were still displayed as question marks. The next day I attacked the problem again and added some debugging statements around the bit of code that I determined must be doing something wrong - this was the section that built a document from a string. I installed the new code on the server and lo and behold everything started working. It seemed that for some reason the debugging code (a logging statement and an XMLOutputter) had some how fixed the problem. I couldn't believe this so I had a think. The best solution I can come up with is that the SAXBuilder I am using to build the document, in the absence of any other evidence, will use the system character encoding to build documents and that restarting Tomcat caused it to re-read the profile with the new updated locale and default character encoding settings. I have discounted the logging statements as the possible solution as I have removed them. What I did add though was a missing xml declaration to the top of the string that gets turned into a document. This will hopefully force the SAXBuilder to treat the text as UTF-8 regardless of the system locale settings.