databuffer utf8 issue
Monkey Forums/Monkey Bug Reports/databuffer utf8 issue| 
 | ||
| Hello I got today an problem with databuffers and special chars like "ä" Strict Import brl.databuffer Function Main:Int() Local s:String = "ä" Local d:DataBuffer = New DataBuffer(s.Length()) 'UTF8 d.PokeString(0, s) Print "Result UTF8:" Print "original: "+s Print "read: "+d.PeekString(0) Print "--------" 'ASCII d.PokeString(0, s, "ascii") Print "Result ASCII:" Print "original: "+s Print "read: "+d.PeekString(0,"ascii") Return 0 End Function The Results are: Result UTF8: original: ä read: ᅢ -------- Result ASCII: original: ä read: ä any idea, what the problem is? | 
| 
 | ||
| Databuffer length looks wrong, you set it to '1' (length of s) but UTF8 will require multiple bytes for storing unicode values >127. Try using s.Length * 3 for a 'worse case' buffer size. | 
| 
 | ||
| Is it possible, that you extend the length method of a string to Length("utf8") or Length("ascii") | 
| 
 | ||
|  marksibly wrote: Try using s.Length * 3 for a 'worse case' buffer size. Utf8 encoding is 1 to 4 bytes, please see http://en.wikipedia.org/wiki/UTF-8  UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code space (1,114,112 code points minus 2,048 surrogate code points) using one to four 8-bit bytes (a group of 8 bits is known as an octet in the Unicode Standard).  k.o.g. wrote: Is it possible, that you extend the length method of a string to Length("utf8") or Length("ascii"). You could write a function for it: Strict
Import brl.databuffer
Function Utf8Length:Int(s:String) ' returns the byte size of Utf8 encoded string
    Local sLen:Int = s.Length()
    If sLen = 0 Then Return 0
    Local buf:DataBuffer = New DataBuffer(4)
    Local byteLen:Int    = 0
    
    For Local i:Int = 0 To sLen-1
        buf.PokeString(0, s[i..i+1] )
        Local firstByte:Int = buf.PeekByte(0)
        If     firstByte & %10000000 = 0
            byteLen += 1
        ElseIf firstByte & %11100000 = %11000000
            byteLen += 2
        ElseIf firstByte & %11110000 = %11100000
            byteLen += 3
        ElseIf firstByte & %11111000 = %11110000
            byteLen += 4
        Endif
    Next
    
    buf.Discard()
    
    Return byteLen
End
Function Main:Int()
    Local strings:String[] = [ "ä",
                               "€",
                               "Hallo €uro",
                               "Thai: สวัสดี",
                               "Burmese: မင်္ဂလာပါ",
                               "Arabic: مرحبا",
                               "Russian: здравствуйте",
                               "Slovak: haló",
                               "Chinese: 您好",
                               "Hebrew: שלום",
                               "Korean: 안녕하세요." ]
    For Local s:String = Eachin strings
        Print "string:           " + s
        Print "character length: " + s.Length()
        Print "Utf8 byte length: " + Utf8Length( s )
        Local d:DataBuffer = New DataBuffer( Utf8Length(s) )
        d.PokeString(0, s)
        
        Local s2:String = d.PeekString(0)
        Print "Utf8 PeekString:  " + s2
        
        If s = s2
            Print "comparison:       correct"
        Else
            Print "comparison:       >>> WRONG! <<<"
        Endif
    
        Print "---------------------------"
        
        d.Discard()
    Next
    Return 0
EndAsciiLength = string.Length(), but you can't map most Unicode characters to ASCII anyway, so it does not make much sense to convert Unicode strings into ASCII strings. | 
| 
 | ||
| Not sure if this is what you might need/want, since you're using a DataBuffer and doing peek/poke and that seems to imply you actually want to modify stuff on the byte level, but if you want to modify stuff on the char level without having a lot of hassle, I wrote a utf8 library to handle this...  I believe I either wrote it before brl.DataBuffer was a thing or was unaware of it (it uses FileStreams instead), but it shouldn't be too difficult to use a DataBuffer instead. Many string operations are supported, but as Danilo said, mapping unicode to ascii isn't 1 to 1 and character folding is a bit more involved. |