Monday, March 28, 2011

script to save file as unicode

Do you know any way that I could programmatically or via scrirpt transform a set of text files saved in ansi character encoding, to unicode encoding?

I would like to do the same as I do when I open the file with notepad and choose to save it as an unicode file.

From stackoverflow
  • Use the System.IO.StreamReader(To read the file contents) class together with the System.Text.Encoding.Encoding(To create the Encoder object which does the encoding) base class.

  • pseudo code...

    Dim system, file, contents, newFile, oldFile

    Const ForReading = 1, ForWriting = 2, ForAppending = 3 Const AnsiFile = -2, UnicodeFile = -1

    Set system = CreateObject("Scripting.FileSystemObject...

    Set file = system.GetFile("text1.txt")

    Set oldFile = file.OpenAsTextStream(ForReading, AnsiFile)

    contents = oldFile.ReadAll()

    oldFile.Close

    system.CreateTextFile "text1.txt"

    Set file = system.GetFile("text1.txt")

    Set newFile = file.OpenAsTextStream(ForWriting, UnicodeFile)

    newFile.Write contents

    newFile.Close

    Hope this approach will work..

  • You can use iconv. On Windows you can use it under Cygwin.

    iconv -f from_encoding -t to_encoding file
    
    guillermooo : Why's the accepted answer related to Cygwin? The question is tagged as powershell...
    river0 : Yes, at the begining I was looking for a powershell solution, but turns out that this worked really good for me and I could also use cygwin. Anyway all the reponses given seem to be valid approaches
  • The easiest way would be Get-Content 'path/to/text/file' | out-file 'name/of/file'.

    Out-File has an -encoding parameter, the default of which is Unicode.

    If you wanted to script a batch of them, you could do something like

    $files = get-childitem 'directory/of/text/files' 
    foreach ($file in $files) 
    {
      get-content $file | out-file $file.fullname
    }
    
  • You could create a new text file and write the bytes from the original file into the new one, placing a '\0' before each original byte (assuming the original text file was in English).

  • This could work for you, but notice that it'll grab every file in the current folder:

    
    Get-ChildItem | Foreach-Object { $c = (Get-Content $_); `
    Set-Content -Encoding UTF8 $c -Path ($_.name + "u") }
    

    Same thing using aliases for brevity:

    
    gci | %{ $c = (gc $_); sc -Encoding UTF8 $c -Path ($_.name + "u") }
    

    Steven Murawski suggests using Out-File instead. The differences between both cmdlets are the following:

    • Out-File will attempt to format the input it receives.
    • Out-File's default encoding is Unicode-based, whereas Set-Content uses the system's default.

    Here's an example assuming the file test.txt doesn't exist in either case:

    
    PS> [system.string] | Out-File test.txt
    PS> Get-Content test.txt
    
    IsPublic IsSerial Name                                     BaseType          
    -------- -------- ----                                     --------          
    True     True     String                                   System.Object     
    
    # test.txt encoding is Unicode-based with BOM
    
    
    
    PS> [system.string] | Set-Content test.txt
    PS> Get-Content test.txt
    
    System.String
    
    # test.txt encoding is "ANSI" (Windows character set)
    

    In fact, if you don't need any specific Unicode encoding, you could as well do the following to convert a text file to Unicode:

    
    PS> Get-Content sourceASCII.txt > targetUnicode.txt
    

    Out-File is a "redirection operator with optional parameters" of sorts.

0 comments:

Post a Comment