FilePath encoding in Haskell

Posted on June 28, 2021

There are a lot of articles on the internet about the various string types in Haskell. There is, in particular, a lot of hate for the String type.

type String = [Char]

But today I want to talk about Strings close relative, FilePath.

type FilePath = String

A lot of people have the wrong idea about FilePath. Even many otherwise excellent resources on Haskell blunder the introduction.

FilePath is just a type synonym for String

– Learn You a Haskell

FilePath is simply another name for String.

– Real World Haskell

FilePath is a type synonym for String. So, for instance, the readFile function takes a String.

– wikibooks.org/wiki/Haskell

For the longest time, I always treated the FilePath synonym as a hint to the user that this particular String is meant to be a file path. I was also always annoyed by the fact that file paths are not really Strings on unix platforms. They are arbitrary sequences of non-0 bytes. How do you refer to the file with an invalid unicode name?

So what gives? Do Haskellers just run around assuming that all filenames are Unicode, and we all live on some perpetual happy path? This was deeply unsatisfying for me. I spent a lot of time messing around with the System.Posix.ByteString library instead, in a hope to be able to work outside of those happy path assumptions. But I was wrong.

The documentation gives a hint that there is more to it, though it is subtle.

File and directory names are values of type String, whose precise meaning is operating system dependent.

While it never explicitly says this, what it means is that FilePath types have a special encoding that is operating system dependent. In particular, a FilePath value uses the file system encoding. From the documentation for getFileSystemEncoding:

The Unicode encoding of the current locale, but allowing arbitrary undecodable bytes to be round-tripped through it.

That answers one question: we can ship arbitrary bytes through a FilePath. For non-windows platforms, the (probably internal) details for this is that it uses “UTF-8b”, which uses “surrogate escaping” to embed arbitrary bytes in the stream. Surrogates are Unicode code points that are illegal in UTF-8 streams. They are Char values in the range of 0xD800 to 0xDBFF and 0xDC00 to 0xDFFF. What the UTF-8b encoding does is, upon encountering an invalid UTF-8 byte, it emits the raw byte bit-wise-ANDed with 0xDC00. So we get a totally valid stream of Chars out. The trick is to undo this on the other end. And indeed, every single function (in base) that takes a FilePath will undo this encoding, and the exact same bytes will come out the other end as came in. If you write a function that takes a FilePath, you better be doing this as well!

This encoding scheme has some pretty important consequences. In particular, bytestring and text libraries provide no mechanism to either create or consume FilePaths, while they both provide the same for String. Do not be fooled! String is not FilePath! You can NOT use Data.Text.pack on a FilePath. It will give you incorrect results! You can not use Data.ByteString.Char8.pack on a FilePath!

Even within base you can’t treat FilePath the same as String. You can NOT (by default) use putStrLn on a FilePath. You can not (by default) use getLine to read a FilePath. putStrLn expects a String and will not perform the required decoding when dealing with a FilePath. You will not be able to read arbitrary byte sequences with getLine, and thus can’t read newline (or, more robustly, \NUL) terminated file names from stdin without hitting the dreaded

hGetLine: invalid argument (invalid byte sequence)

on funny file names.

If you do want to use these with FilePaths instead of Strings, you will need to tell GHC to do so by using hSetEncoding on the relevant handles and setting the encoding to the file system encoding.

Be aware that using ByteString IO, as is often recommended, will not save you. You can not convert a ByteString to a FilePath, at least not in a way supported by the bytestring package.

What about FFI? Well, do NOT use Foreign.C.String to marshal FilePaths. That module is for Strings only! The functions in that module use the locale encoding, not the file system encoding. You want to use the functions in GHC.Foreign, and provide them with the correct TextEncoding (probably the one you get when you query getFileSystemEncoding). If you know the encoding you want is the file system encoding (and not, say, the argv encoding) you can also use System.Posix.ByteString.FilePath.withFilePath which does it for you.

You can also use GHC.Foreign to construct and deconstruct a ByteString if you are using a library that has made the misguided, but probably well-meaning, decision to accept or supply ByteStrings for file names. Do this by composing GHC.Foreign.newCStringLen with Data.ByteString.Unsafe.unsafePackMallocCStringLen and GHC.Foreign.peekCStringLen with Data.ByteString.useAsCStringLen. Note: The functions in the GHC.Foreign module accept Strings, but here it is OK because they also accept the TextEncoding.

And finally, getArgs is the mischievous rule breaker here. By default, on non-windows systems, argvEncoding is the same as the file system encoding. So, when you use System.Environment.getArgs on Unix systems, you are actually getting a [FilePath] instead of a [String] as documented.

Exercise: can you now implement a simple echo in Haskell? (expand for solution)
module Main where

import System.Environment
import GHC.IO.Encoding

main :: IO ()
main = do
    args <- getArgs
    argvEncoding >>= setLocaleEncoding
    putStrLn (unwords args)

Did your simple echo program work with the following invocation?

./myEcho One Two $(printf '\231')

Conclusion

Be aware of what TextEncoding is in use. As String and FilePath tend to use different ones, and mixing them is obviously bad. Ideally each different encoding would be realised in the type system, but alas, it is not so.