UTFCast Professional Manual

Updated: 1/13/2017

Copyright © RotatingScrew.com

 

 

What is a warning

You may often see warnings when detecting or converting files, they are not errors. A warning comes when the detection engine is not sure what kind the file is or what codepage the file is in.

 

Codepage detection is a job based on data statistics, there is no "100% sure" of the detection technology. UTFCast Professional has four different detection engines working together to ensure the detection result. However, if the file is not large enough, or some text is too short in the file, it will be very hard to make sure what kind of the data is. In this case, a warning occurs.

 

All files in warning status are not converted. If you have chosen to copy unconverted files, the files are copied to the output directory without conversion. You can use the preview panel to verify if the detection result is correct and then right-click on the file and use "Accept result" function to ignore the warning and convert the files with the detected codepage. Or you can use "Make correction" function to specify a different codepage to convert the files.

 

File name filters

Wildcard filter

A wildcard is a symbol that represents an unknown character or a set of characters. UTFCast Professional supports two wildcard symbols: The asterisk (*) for any number of unknown characters, and the question mark (?) for only one unknown character.

 

You can mix-and-match the asterisk (*) and the question mark (?), as well as combine multiple wildcard strings with the semicolon (;). If a file name does not match any of your provided wildcard strings, the file will be ignored.

 

Examples:

Given the below string, only the file names starting with the character W and ending with the .TXT extension will be picked:

w*.txt

Given the below string, only the file names with two characters and with the .PHP extension will be picked:

??.php

Given the below string, only the file names having two or three characters, while ending with any extension will be picked:

??.*; ???.*

Note: The wildcard string *.* matches any file name that ends with any extension except a file name without an extension, because a file name without an extension does not have a dot in the middle. To match any file name with and without an extension, use a single asterisk in the wildcard string.

 

 

Regular expression filter

UTFCast Professional supports the ECMA Script (ECMA-262) regular expression. For the full specification of the ECMA Script standard, please refer to EMCA website or download the specification document from http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf.

 

Example:

Given the below regular expression, the numeric file names with a .TXT extension will be picked:

\d+\.txt

 

Settings

UTFCast Professional works well with default settings in most cases. However, you may experience performance difference when converting large sized files or a plenty of small sized files. The settings allows you to fine tune the converter parameters for your specification and maximize its performance on your system. It also allows you to change the default behaviors when executing the Instant Conversion, and the default parameters on the Custom Conversion dialog.

 

 

Default Settings

Changing the below settings affects the behavior of Instant Conversion and the default settings of Custom Conversion.

 

Copy unconverted files

 

Instant Conversion does not copy unconverted files to the output directory by default, if you would like to change this behavior, change this setting here. This setting can also affect the Copy Unconverted setting in Custom Convert function.

 

Write BOM to converted files

 

If you would like Instant Conversion to write BOM to ouputed files, you should set this setting on. It can also affect the BOM setting in Custom Convert function.

 

Process hidden files

 

Hidden files and hidden folders in source folder are ignored when this option is turned off.

 

Output encoding

 

Specify output encoding for Instant Conversion, In-place Conversion and the default setting for Custom Conversion.

 

Return type

 

Specify return type for Instant Conversion, In-place Conversion and the default setting for Custom Conversion.

 

Advanced Settings

Converter Threads

 

Converter Threads setting is auto-detected by default. Its value is based on how many Logical CPUs your system has. Decreasing this value will result in a fewer CPU usage when converting files. However, increasing it does not always mean a better performance. The best performance is subject to the overall performance of your system, especially your hard disk drive performance and memory performance. Leaving this setting to its default value suits most cases.

 

Chunk Size

 

A chunk is a series of memory space that UTFCast uses to hold the content of a file. Using a smaller chunk size results in more hard disk Input/Output operations of a file, but a faster processing speed of memory operation and fewer memory usage. A larger chunk size results in the opposite effect. But if to use a very large chunk to process a small file, it can waste time to allocate unnecessary memory space. If you always need to convert large files, it is better to increase the Chuck Size. For example, if you need to convert 1000 files x 10MB/each, you can set the Chunk Size to its maximum size because it will obviously reduce hard disk IO operation and increase performance.

 

Sample Size

 

The Sample Size is only used for codepage detection. A larger sample size results in a more precision detection result, but also more memory usage and a slower detection speed. This setting does not affect a file which its size is smaller than the sample size.

 

Use codepage GB18030 instead of GB2312

 

See the section: GB18030 Support

 

Binary List Acceleration

 

The binary list is a list that contains many file extensions. Some files such as .exe files, .rar files, .zip files, .pdf files and other known files are well-known as binary file types, detecting these files always need much time and the detection result is imprecise. Enabling this setting can skip detecting such files to increase detection speed. If you have some files that are text files but they are always ignored by the detection engine, you should turn off this option. Otherwise you can keep this option on to increase detection speed.

 

Use Consolidated Buffer

 

Consolidated Buffer is a better memory management method since version 1.5. It can reduce memory fragments, provide higher memory accessing speed, and thus increase a little bit of conversion performance, especially when converting a huge of files. A higher buffer size provides higher memory performance but uses more memory space. This option is enabled by default.

 

Codepage Reference

GB18030 Support

The Chinese standard codepage GB18030 is a superset of the Chinese standard codepage GB2312. By default, UTFCast Professional does not turn on GB18030 support for compatibility considerance.

 

In order to support the display and process of all characters of the Chinese standard GB18030 on Windows XP, the installation of the Microsoft GB18030 Support Package is needed. This support package will update an XP system with, among other things, conversion libraries, fonts and input-method-editors (IMEs) to correctly support GB18030. The support package is available as a download from the Microsoft website. Click here to open the download page.

 

Windows Vista and above versions has built-in full support of GB18030 and no additional package is needed.

 

If you have confirmed your operating system supports GB18030, you can go to File->Preferences menu, and turn on "Use codepage GB18030 instead of codepage GB2312".

 

Supported Codepages

UTFCast Professional supports detecting and reading the below input codepages:

 

ASCII

Big5

EUC-JP (EUC 20932 subset only)

EUC-KR

EUC-TW

GB18030 (*A)

GB2312

HZ-GB2312

IBM855

IBM866

ISO-2022-CN

ISO-2022-JP (JIS)

ISO-2022-KR

ISO-8859-2

ISO-8859-5

ISO-8859-7

ISO-8859-8

KOI8-R

Shift-JIS

UCS-4-2143

UCS-4-3412

UTF-16 Big Endian

UTF-16 Little Endian

UTF-32 Big Endian

UTF-32 Little Endian

UTF-8

Windows-1250

Windows-1251

Windows-1252

Windows-1253

Windows-1255

Windows-874 (TIS 620)

x-mac-cyrillic

 

A. See this section for details: GB18030 Support

 

Supported output codepages:

 

UTF-8 without BOM

UTF-8 with BOM

UTF-16 Big Endian without BOM

UTF-16 Big Endian with BOM

UTF-16 Little Endian without BOM

UTF-16 Little Endian with BOM

UTF-32 Big Endian without BOM

UTF-32 Big Endian with BOM

UTF-32 Little Endian without BOM

UTF-32 Little Endian with BOM

UCS-4-2143 without BOM

UCS-4-2143 with BOM

UCS-4-3412 without BOM

UCS-4-3412 with BOM

 

Supported output Return-Types (Also known as CR/LF Style):

 

No change (Keep the original Return-Type as is)

Force CRLF (Windows Style)

Force CR Only (Macintosh Style)

Force LF Only (Unix/Linux Style)

 

Command Line Reference

Command Line Syntax

It is possible to run UTFCast Professional in command line mode. The syntax is:

UTFCastPro.exe /switch:argument /switch

 

Switches And Arguments

 

Switch

Argument

Description

Comment

/in

"Source"

Specify which folder or file to be converted


/out

"Output"

Specify which folder or file to output

In DIR mode, if this command is not specified, a sibling folder name will be generated. For example, a folder named 'Source_Folder (Converted)'.
In FILE mode, this command must be specified.
If any part of the output path does not exist, a corresponding folder will be created.

If /d switch is present, this switch is ignored.

/r


Recursively conversion

 

/c


Copy unconverted files


/h


Process hidden files

If this command is not specified, hidden files and hidden folders in source folder will be ignored.

/quiet


Quiet mode

No user interaction.

/mode

DIR

The Source is a folder

The Output must be a folder. If this command is not specified, DIR mode is assumed.

FILE

The Source is a file

The Output must be a file.

BACHUITE

The Source is a Bachuite file


/enc

UTF8

Convert files to UTF-8

UTF-8 is assumed if this command is not present.

UTF16

Equivalent to  UTF16LE

UTF16LE

Convert files to UTF-16 Little Endian

UTF16BE

Convert files to UTF-16 Big Endian

UTF32

Equivalent to UTF32LE

UTF32LE

Convert files to UTF-32 Little Endian

UTF32BE

Convert files to UTF-32 Big Endian

2143

Convert files to UCS-4-2143

3412

Convert files to UCS-4-3412

/bom

YES

Write a BOM to a converted file

A BOM will be written if this command is not present.

NO

Do not write a BOM to a converted file

/rt

CR

Set return type to CR (Macintosh)

Return type will not be changed if this command is not present.

LF

Set return type to LF (Unix)

CRLF

Set return type to CRLF (Windows)

/wf

A wildcard string

Apply wildcard filter

If both wf and rf are present, wf is used, unless its value is set to empty.

/rf

A regular expression

Apply regular expression filter

/cp

A codepage identifier

Skip auto-detection and manuall specify codepage decoder

If the source file is a Unicode text file with a BOM, the Codepage Identifier is ignored. Refer to the Codepage Identifiers section for the full list of available identifiers.

/logfile

"Path_To_Log_File"

Write debug messages to specified file

UTFCast Professional must have the required privilege to access the specified log file path, otherwise logging will fail.

/d

 

Detection only

Detect the file (in file mode) or directory (in dir mode) provided with /in switch.

If this switch is present, the /out switch is ignored.

To record detection result, use in combination with /export switch.

/export

"Path_To_CSV_Result_File"

Export detection or conversion result to a CSV file

The exported result file is in UTF-8. To specify a different encoding, use with /exportenc switch.

/exportenc

UTF8

Encode exported result in UTF-8

UTF-8 is assumed if this switch is not present.

UTF16

Equivalent to UTF16LE

UTF16LE

Encode exported result in UTF-16 Little Endian

UTF16BE

Encode exported result in UTF-16 Big Endian

UTF32

Equivalent to UTF32LE

UTF32LE

Encode exported result in UTF-32 Little Endian

UTF32BE

Encode exported result in UTF-32 Big Endian

/exportbom

YES

Add a BOM to the exported result file

If this  switch is not present, a BOM will be added.


NO

No BOM is added to the exported result file

/resetlayout

 

Reset the GUI layout data to its initial state


/cmdfile

"Path_To_Command_Line_File"

Read the command line from a text file instead

Windows has a 260-character path length limit. To pass a very long command line to UTFCast Professional, you can use a Command Line File. See the details in Using Command Line File.

 

Codepage Identifiers

A codepage identifier is a numberic for UTFCast to identify which codepage you are referring to. The below table shows the complete list of supported codepage identifiers.

 

Codepage Name

Codepage Identifier

Big5

950

EUC-JP

20932

EUC-KR

51949

EUC-TW

51950

GB2312

936

GB18030

54936

HZ-GB-2312

52936

IBM855

855

IBM866

866

ISO-2022-JP

50222

ISO-2022-KR

50225

ISO-2022-CN

50227

ISO-8859-2

28592

ISO-8859-5

28595

ISO-8859-7

28597

ISO-8859-8

28598

KOI8-R

20866

MAC-Cyrillic

10007

Shift-JIS

932

UCS-4-3412

3412

UCS-4-2143

2143

UTF-8

65001

UTF-16LE

1200

UTF-16BE

1201

UTF-32LE

12000

UTF-32BE

12001

Windows-874

874

Windows-1250

1250

Windows-1251

1251

Windows-1252

1252

Windows-1253

1253

Windows-1255

1255

 

Using Command Line File

Windows has a 260-character path length limit. That means you cannot access a file, a directory, or run a command line that its total length is longer than 260 characters. UTFCast Professional provides various command line switches and arguments for you to control the command line mode, some of them also accept a path to a file or a directory. If you combine multiple switches in command line mode, it is possible that your command line will exceed 260 characters. Besides, UTFCast Professional can access paths that longer than 260 characters, if you would like to pass a very long path as an argument to UTFCast Professional in command line mode, Windows does not allow you to do that.

 

A Command Line File is introduced to UTFCast Professional since version 2.8. It is simply a text file that its content is the command line switches and arguments. Because you can store very long text in a text file, so UTFCast Professional can read the command line from the text file up to 32768 characters.

 

It's very easy to use this Command Line File. You store all of your command line switches and arguments in the first line of the text file, and pass the /cmdfile switch with an argument pointing to the command line file, and UTFCast Professional will do the rest. For example:

 

UTFCastPro.exe /cmdfile:"C:\MyUTFCastCommandLine.txt"

And now in your C:\MyUTFCastCommandLine.txt can contain a full command line in the first line like the below example (note that the keyword UTFCastPro.exe must not be in the command line file):

/in:"C:\Very Very Very Long Long Long Path That Exceeds 260 characters\Input.txt" /out:"C:\Very Very Very Long Long Long Path That Exceeds 260 characters\Output.txt" /mode:file /bom:yes /enc:utf8 /export:"D:\My UTFCast Logs\Today.log"

 

Command Line Examples

1) To convert every text file in C:\MyFolder, include any files in subfolders but exclude hidden files and hidden folders, save the converted files to D:\MyOutput as UTF-16BE without BOM encoding, the command line is:

UTFCastPro.exe /in:"C:\MyFolder" /out:"D:\MyOutput" /r /enc:utf16be /bom:no

2) To convert the file C:\MyFile.txt to C:\MyConvertedFile.txt as UTF-8 with BOM encoding, skip auto-detection and manually specify the Windows-1252 decoder to read the source file, the command line is:

UTFCastPro.exe /in:"C:\MyFile.txt" /out:"D:\MyConvertedFile.txt" /enc:utf8 /bom:yes /mode:file /cp:1252

3) Logging

UTFCastPro.exe /in:"C:\MyFiles" /out:"D:\MyConvertedFiles" /enc:utf8 /bom:yes /mode:file /cp:1252 /logfile:"D:\UTFCastPro.log"

Bachuite Reference

Merging multiple tasks

Let’s start with an example. The below command line converts your files to another folder:

UTFCastPro.exe /in:"D:\My Files" /out:"D:\My Output" /enc:utf8 /rt:crlf /bom:YES

With Bachuite, instead, you use XML elements to describe the same command, only the command line arguments turn into XML attributes:

<dir in="D:\My Files" out="D:\My Output" enc="utf8" rt="crlf" bom="yes"/>

If you want to convert multiple sibling folders and some single files with the command line, you’ll need to run the command line multiple times, one time for each task:

UTFCastPro.exe /in:"D:\My Files A" /out:"D:\My Output A" /enc:UTF8 /bom:YES /rt:CRLF
UTFCastPro.exe /in:"D:\My Files B" /out:"D:\My Output B" /enc:UTF8 /bom:NO /rt:CRLF
UTFCastPro.exe /in:"D:\Single File A.txt" /out:"D:\Single File Output A.txt" /enc:UTF16LE /mode:FILE /rt:CRLF /bom:YES
UTFCastPro.exe /in:"D:\Single File B.txt" /out:"D:\Single File Output B.txt" /enc:UTF16LE /mode:FILE /rt:CRLF /bom:NO
UTFCastPro.exe /in:"D:\Single File C.txt" /out:"D:\Single File Output C.txt" /enc:UTF16LE /mode:FILE /rt:CRLF /bom:YES

With Bachuite, instead, you can simply wrap the command lines to Bachuite XML so that to get the job done with running UTFCast only once:

<dir in="D:\My Files A" out="D:\My Output A" enc="utf8" bom="yes" rt="crlf"/>
<dir in="D:\My Files B" out="D:\My Output B" enc="utf8" bom="no" rt="crlf"/>
<file in="D:\Single File A.txt" out="D:\Single File Output A.txt" enc="utf16le" rt="crlf" bom="yes" />
<file in="D:\Single File B.txt" out="D:\Single File Output B.txt" enc="utf16le" rt="crlf" bom="no" />
<file in="D:\Single File C.txt" out="D:\Single File Output C.txt" enc="utf16le" rt="crlf" bom="yes" />

In fact, simple wrapping is just one of the options. Bachuite can do multiple tasks with ease by using Sets.

 

Using Sets

A set is a group of elements that share the same predefined attributes. Here’s an example of using Sets:

<set rt="crlf" bom="yes" enc="utf8">

  <!-- The below tasks inherit rt, bom and enc from the parent set -->
  <dir in="D:\My Files A" out="D:\My Output A" />
  <dir in="D:\My Files B" out="D:\My Output B" bom="no" />

    <!-- A child set inherits properties too -->
    <set enc="utf16le">
      <file in="D:\Single File A.txt" out="D:\Single File Output A.txt" />
      <file in="D:\Single File B.txt" out="D:\Single File Output B.txt" bom="no"/>
      <file in="D:\Single File C.txt" out="D:\Single File Output C.txt" />
    </set>

</set>

As you can see, if the attribute is identical to that of its parent, you don’t need to specify an attribute for any child element (either any child set or child task). Attributes are inherited by default but you can also override at any time.

 

Using Links

Links can be used for reusing Bachuite files. Here’s an example:

<link src="D:\MyBachuite1.xml" />
<link src="MyBachuite2.xml" enc="utf32be" bom="no" />

The src attribute must be pointing to an existing Bachuite file. Otherwise the whole Bachuite will not run.

 

Using Profiles

A profile is a set of predefined attributes for reusing in other elements. A Profile element is like a Set element with a name, but cannot have children. When a profile is defined, it covers the scope that any sibling elements and their children defined below it, and only elements in the covered scope can access it. If an element is assigned with an existing profile, the element does not inherit any attribute from its parent, it inherits the profile's parent attribute instead, and copies all attributes from the profile. Bachuite applies all attributes of the profile to the element first, and then applies explicitly presented attributes.

 

NOTE: Bachuite profile elements are similar to the setting profile feature in the GUI, but they are not the same feature. They are designed for and work in different environments. There's no way to load saved setting profiles or Bachuite profiles in each other.

 

Here's an example of using profiles:

<set enc="utf8" in="D:\Text Files">
	<!-- A profile also inherits attributes from its parent, just like other elements. -->
	<!-- The below profiles also have the enc attribute set to "utf8" and the in attribute set to "D:\Text Files" even these attributes are not explicitly presented. -->
	<profile name="with_bom" bom="yes" rt="crlf" />
	<profile name="without_bom" bom="no" rt="crlf" />

	<!-- All elements below here and their children can use the two profiles defined above -->
	<dir out="D:\Output with bom" profile="with_bom" />
	<dir out="D:\Output without bom" profile="without_bom" />

	<!-- A profile can also be assigned to a set, a link, or even another profile -->
	<!-- Explicitly setting an attribute value (enc="utf16" in this example) overrides it -->
	<profile name="different_enc" enc="utf16" out="D:\Profile Out" profile="without_bom" />

	<set out="D:\New Output">
		<!-- Because a profile is assigned, this element does not inherited any attribute from its parent, the out attribute value "D:\Profile Out" which is copied from the profile is used -->
		<dir profile="different_enc" />
	</set>
</set>
<set enc="utf16le">
	<!-- ERROR, the below element is out of the "with_borm" profile's scope -->
	<dir in="D:\Text Files" out="D:\Output with bom" profile="with_bom" />
</set>

 

Resolving absolute and relative paths

Any path in an element can be an absolute path like C:\MyFile.txt, or a relative path like: MyFile.txt or ..\MyFile.txt. If a path is a relative one, it will be resolved to the relative location of the current Bachuite file. For example:

 

In C:\MyBachuite1.xml:

<link src="D:\MyBachuite2.xml" />
<link src="MyBachuite3.xml" />
<file in="MyFile.txt" out="SubDir\MyFileOutput.txt" />

In D:\abc\MyBachuite2.xml:

<link src="MyBachuite3.xml" />
<file in="MyFile.txt" out="SubDir\MyFileOutput.txt" />

When linking to MyBachuite3.xml in C:\MyBachuite1.xml, the path of MyBachuite3.xml is resolved to C:\MyBachuite3.xml. When linking to MyBachuite3.xml in D:\abc\MyBachuite2.xml, the path of MyBachuite3.xml is resolved to D:\abc\MyBachuite3.xml.

 

The same thing applies to paths in other elements. In C:\MyBachuite1.xml, MyFile.txt and MyFileOutput.txt are resolved to C:\MyFile.txt and C:\SubDir\MyFileOutput.txt. In D:\abc\MyBachuite2.xml, they are resolved to D:\abc\MyFile.txt and D:\abc\SubDir\MyFileOutput.txt.

 

Running Bachuite

The Bachuite XML must be saved as an XML file. Its content is nothing more than a normal XML file with the Bachuite root element and the Bachuite XML schema. For example, save the below XML to D:\MyBachuite.xml:

<?xml version="1.0" encoding="UTF-8"?>
<bachuite version="1.0">

  <!-- Your Bachuite XML goes here -->

</bachuite>

Run the Bachuite file using the below command line:

UTFCastPro.exe /in:"D:\MyBachuite.xml" /mode:bachuite

Bachuite Attributes

All available attributes are listed in the below table. Elements and Attributes are case sensitive, however, Attribute Values are case insensitive.

 

Elements

Attribute

Value

Description

Remarks

set, file, dir, link, profile

in

A Path to a file or a folder

Specify which folder or file to convert.


out

A Path to a file or a folder

Specify which folder or file to output.

In a DIR element, if this command is not specified, a sibling folder name will be generated. For example, a folder named 'Source_Folder (Converted)'.

In a FILE element, this command must be specified.

If any part of the output path does not exist, a corresponding folder will be created.

 

r

YES

Recursive conversion.

NO is assumed if the attribute is not present.

NO

Non-recursive conversion.

c

YES

Copy unconverted files.

NO is assumed if the attribute is not present.

NO

Ignore unconverted files.

h

YES

Process hidden files.

NO is assumed if the attribute is not present.

NO

Don’t process hidden files.

enc

UTF8

Convert to UTF-8

UTF-8 is assumed if the attribute is not present.

UTF16LE

Convert to UTF-16 Little Endian.

UTF16BE

Convert to UTF-16 Big Endian.

UTF32LE

Convert to UTF-32 Little Endian.

UTF32BE

Convert to UTF-32 Big Endian.

2143

Convert to UCS-4-2143.

3412

Convert to UCS-4-3412.

bom

YES

Write a BOM to a converted file.

YES is assumed if the attribute is not present.

NO

Do not write a BOM to a converted file.

rt

CR

Set return type to CR (Macintosh)

NOCHANGE is assumed if the attribute is not present.

LF

Set return type to LF (Unix)

CRLF

Set return type to CRLF (Windows)

NOCHANGE

Don’t change return type

cp

A codepage identifier

Skip auto-detection and manually specify codepage decoder

If the source file is a Unicode text file with a BOM, the Codepage Identifier is ignored. Refer to the Codepage Identifiers section for the full list of available identifiers.

wf

A wildcard string

Apply wildcard filter

If both wf and rf are present, wf is assumed unless its value is set to empty.

To disable filters, either set both values to empty, or do not provide any of them in the element or any parent elements.

 

rf

A regular expression

Apply regular expression filter


profile

A profile name

Assign a profile to the element

Assigning a profile to an element copies all attribute values (including implicit and explicit attribute values) from the profile.

Only profiles that defined in the same or parent scope can be accessed by the current element.

link

src

A path to a Bachuite file

Link to an external Bachuite file

Linking to a non-existent Bachuite file makes the whole Bachuite refuse to run.

profile

name

A unique name

Define a profile

The profile can be accessed by all sibling elements and their children below it.