Skip to main content

encoding

Utilities for encoding detection and conversion.

encoding.convert_options

handle_from_bom

(field) handle_from_bom: boolean

If applicable strips the byte order marks.


handle_to_bom

(field) handle_to_bom: boolean

If applicable adds the byte order marks.


strict

(field) strict: boolean

When true fail if errors found.


convert

function encoding.convert(tocharset: "ARMSCII-8"|"BIG5"|"BIG5-HKSCS"|"CP866"|"CP932"...(+57), fromcharset: "ARMSCII-8"|"BIG5"|"BIG5-HKSCS"|"CP866"|"CP932"...(+57), text: string, options?: encoding.convert_options)
-> converted_text: string|nil
2. errmsg: string

Converts the given text from one encoding into another.

tocharset:
-\> "ARMSCII-8"
| "BIG5"
| "BIG5-HKSCS"
| "CP866"
| "CP932"
| "EUC-JP"
| "EUC-KR"
| "EUC-TW"
| "GB18030"
| "GB2312"
| "GBK"
| "GEORGIAN-ACADEMY"
| "HZ"
| "IBM850"
| "IBM852"
| "IBM855"
| "IBM857"
| "IBM862"
| "IBM864"
| "ISO-2022-JP"
| "ISO-2022-KR"
| "ISO-8859-1"
| "ISO-8859-2"
| "ISO-8859-3"
| "ISO-8859-4"
| "ISO-8859-5"
| "ISO-8859-6"
| "ISO-8859-7"
| "ISO-8859-8"
| "ISO-8859-8-I"
| "ISO-8859-9"
| "ISO-8859-10"
| "ISO-8859-13"
| "ISO-8859-14"
| "ISO-8859-15"
| "ISO-8859-16"
| "ISO-IR-111"
| "JOHAB"
| "KOI8-R"
| "KOI8-U"
| "SHIFT_JIS"
| "TCVN"
| "TIS-620"
| "UCS-2BE"
| "UCS-2LE"
| "UHC"
| "UTF-16BE"
| "UTF-16LE"
| "UTF-32BE"
| "UTF-32LE"
| "UTF-7"
| "UTF-8"
| "VISCII"
| "WINDOWS-1250"
| "WINDOWS-1251"
| "WINDOWS-1252"
| "WINDOWS-1253"
| "WINDOWS-1254"
| "WINDOWS-1255"
| "WINDOWS-1256"
| "WINDOWS-1257"
| "WINDOWS-1258"

fromcharset:
-\> "ARMSCII-8"
| "BIG5"
| "BIG5-HKSCS"
| "CP866"
| "CP932"
| "EUC-JP"
| "EUC-KR"
| "EUC-TW"
| "GB18030"
| "GB2312"
| "GBK"
| "GEORGIAN-ACADEMY"
| "HZ"
| "IBM850"
| "IBM852"
| "IBM855"
| "IBM857"
| "IBM862"
| "IBM864"
| "ISO-2022-JP"
| "ISO-2022-KR"
| "ISO-8859-1"
| "ISO-8859-2"
| "ISO-8859-3"
| "ISO-8859-4"
| "ISO-8859-5"
| "ISO-8859-6"
| "ISO-8859-7"
| "ISO-8859-8"
| "ISO-8859-8-I"
| "ISO-8859-9"
| "ISO-8859-10"
| "ISO-8859-13"
| "ISO-8859-14"
| "ISO-8859-15"
| "ISO-8859-16"
| "ISO-IR-111"
| "JOHAB"
| "KOI8-R"
| "KOI8-U"
| "SHIFT_JIS"
| "TCVN"
| "TIS-620"
| "UCS-2BE"
| "UCS-2LE"
| "UHC"
| "UTF-16BE"
| "UTF-16LE"
| "UTF-32BE"
| "UTF-32LE"
| "UTF-7"
| "UTF-8"
| "VISCII"
| "WINDOWS-1250"
| "WINDOWS-1251"
| "WINDOWS-1252"
| "WINDOWS-1253"
| "WINDOWS-1254"
| "WINDOWS-1255"
| "WINDOWS-1256"
| "WINDOWS-1257"
| "WINDOWS-1258"

detect

function encoding.detect(filename: string)
-> charset: string|nil
2. errmsg: string

Try and detect the encoding to best of capabilities for given file given or returns nil and error message on failure.


detect_string

function encoding.detect_string(text: string)
-> charset: string|nil
2. errmsg: string

Same as encoding.detect() but for strings.


get_charset_bom

function encoding.get_charset_bom(charset: "ARMSCII-8"|"BIG5"|"BIG5-HKSCS"|"CP866"|"CP932"...(+57))
-> bom: string

Get the byte order marks for the given charset if applicable.

charset:
-\> "ARMSCII-8"
| "BIG5"
| "BIG5-HKSCS"
| "CP866"
| "CP932"
| "EUC-JP"
| "EUC-KR"
| "EUC-TW"
| "GB18030"
| "GB2312"
| "GBK"
| "GEORGIAN-ACADEMY"
| "HZ"
| "IBM850"
| "IBM852"
| "IBM855"
| "IBM857"
| "IBM862"
| "IBM864"
| "ISO-2022-JP"
| "ISO-2022-KR"
| "ISO-8859-1"
| "ISO-8859-2"
| "ISO-8859-3"
| "ISO-8859-4"
| "ISO-8859-5"
| "ISO-8859-6"
| "ISO-8859-7"
| "ISO-8859-8"
| "ISO-8859-8-I"
| "ISO-8859-9"
| "ISO-8859-10"
| "ISO-8859-13"
| "ISO-8859-14"
| "ISO-8859-15"
| "ISO-8859-16"
| "ISO-IR-111"
| "JOHAB"
| "KOI8-R"
| "KOI8-U"
| "SHIFT_JIS"
| "TCVN"
| "TIS-620"
| "UCS-2BE"
| "UCS-2LE"
| "UHC"
| "UTF-16BE"
| "UTF-16LE"
| "UTF-32BE"
| "UTF-32LE"
| "UTF-7"
| "UTF-8"
| "VISCII"
| "WINDOWS-1250"
| "WINDOWS-1251"
| "WINDOWS-1252"
| "WINDOWS-1253"
| "WINDOWS-1254"
| "WINDOWS-1255"
| "WINDOWS-1256"
| "WINDOWS-1257"
| "WINDOWS-1258"

strip_bom

function encoding.strip_bom(text: string, charset?: "ARMSCII-8"|"BIG5"|"BIG5-HKSCS"|"CP866"|"CP932"...(+57))
-> cleaned_text: string

Remove the byte order marks from the given text.

@param text — A string that may contain a byte order marks.

@param charset — Charset to scan, if nil scan all charsets with bom.

charset:
-\> "ARMSCII-8"
| "BIG5"
| "BIG5-HKSCS"
| "CP866"
| "CP932"
| "EUC-JP"
| "EUC-KR"
| "EUC-TW"
| "GB18030"
| "GB2312"
| "GBK"
| "GEORGIAN-ACADEMY"
| "HZ"
| "IBM850"
| "IBM852"
| "IBM855"
| "IBM857"
| "IBM862"
| "IBM864"
| "ISO-2022-JP"
| "ISO-2022-KR"
| "ISO-8859-1"
| "ISO-8859-2"
| "ISO-8859-3"
| "ISO-8859-4"
| "ISO-8859-5"
| "ISO-8859-6"
| "ISO-8859-7"
| "ISO-8859-8"
| "ISO-8859-8-I"
| "ISO-8859-9"
| "ISO-8859-10"
| "ISO-8859-13"
| "ISO-8859-14"
| "ISO-8859-15"
| "ISO-8859-16"
| "ISO-IR-111"
| "JOHAB"
| "KOI8-R"
| "KOI8-U"
| "SHIFT_JIS"
| "TCVN"
| "TIS-620"
| "UCS-2BE"
| "UCS-2LE"
| "UHC"
| "UTF-16BE"
| "UTF-16LE"
| "UTF-32BE"
| "UTF-32LE"
| "UTF-7"
| "UTF-8"
| "VISCII"
| "WINDOWS-1250"
| "WINDOWS-1251"
| "WINDOWS-1252"
| "WINDOWS-1253"
| "WINDOWS-1254"
| "WINDOWS-1255"
| "WINDOWS-1256"
| "WINDOWS-1257"
| "WINDOWS-1258"