Make your own free website on Tripod.com
« May 2012 »
S M T W T F S
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31
You are not logged in. Log in
Entries by Topic
All topics  «
Blog Tools
Edit your Blog
Build a Blog
RSS Feed
View Profile
Open Community
Post to this Blog
ne12abaa3python
Sunday, 22 January 2006
Unicode in Python
Unicode in Python

The first thing to know about Python's Unicode support is that you may need to install a recent version of Python to get it. Users of RedHat Linux 7.x have Python 1.5.2 by default, for compatibility reasons. Unicode support was introduced in Python 1.6.
Unicode Strings in Python

Python has two different string types: an 8-bit non-Unicode string type (str) and a 16-bit Unicode string type (unicode).

Unicode strings are written with a leading u. They may contain Unicode escape sequences of the form \u0000, just as in Java. For example:

question = u'\u00bfHabla espa\u00f1ol?' # ?Habla espa?ol?

Some Unicode characters have numbers beyond U+FFFF, so Python has another escape: \U00000000, which offers more than enough digits to specify any Unicode codepoint. (Recent C and C++ standards also offer this, but Java does not.)

Python also offers a \N escape which allows you to specify any Unicode character by name.

# This string has 7 characters in all, including the spaces
# between the symbols.
symbols = u'\N{BLACK STAR} \N{WHITE STAR} \N{LIGHTNING} \N{COMET}'

One more way to build a Unicode string object is with the built-in unichr() function, which is the Unicode version of chr().
Unicode Support in the Python Standard Library

Unicode strings are very similar to Python's ordinary 8-bit strings. They have the same useful methods (split(), strip(), find(), and so on). The + and * operators work on Unicode strings just as they do for plain strings. And like plain strings, Unicode strings can do printf-like formatting, using the % symbol. For the most part, you'll feel right at home.

This seamlessness extends to most of Python's standard library.

*

Python regular expressions can search Unicode strings.
*

Python's standard gettext module supports Unicode. This is the module to use for internationalization of Python programs.
*

The Tkinter GUI toolkit offers excellent Unicode support. Here is a minimal Hello, world program using Unicode and Tkinter.
*

Python's standard XML library is Unicode-aware (as required by the XML specification).

Most of the standard library works smoothly with Unicode strings. Some modules still are not fully Unicode-friendly, but the most important pieces are in place.
Unicode files and Python

Reading and writing Unicode files from Python is simple. Use codecs.open() and specify the encoding.

import codecs
# Open a UTF-8 file in read mode
infile = codecs.open("infile.txt", "r", "utf-8")
# Read its contents as one large Unicode string.
text = infile.read()
# Close the file.
infile.close()

The same function is used to open a file for writing; just use "w" (write) or "a" (append) as the second argument.

A fourth argument, after the encoding, can be provided to specify error-handling. The possible values are:

* 'strict' - The default. Throw exceptions if errors are detected while encoding or decoding data.
* 'ignore' - Skip over errors or unencodeable characters.
* 'replace' - Replace bad or unencodeable data with a "replacement character", usually a question mark.

Since 'strict' is the default, expect a lot of UnicodeExceptions to be thrown if your data isn't quite right. Once you get the hang of it, those errors become much less frequent.

Sometimes a program simply needs to encode or decode a single chunk of Unicode data. This, too, is easy in Python: Unicode strings have an encode() method that returns a str, and str objects have a decode() method that returns a unicode string.

# Suppose we are given these bytes, perhaps over a socket
# or perhaps taken from a database.
bytes = 'Bun\xc4\x83-diminea\xc8\x9ba, lume'

# We want to convert these UTF-8 bytes to a Unicode string.
unicode_strg = bytes.decode('utf-8')

# Now print it, but in the ISO-8859-1 encoding, because
# (let's suppose) that is the format of our display.
print unicode_strg.encode('iso-8859-1', 'replace')

However, note that in this particular example, the source string contains two characters (ă and ț) that are not available in ISO-8859-1! Unfortunately, if our display can only handle ISO-8859-1 characters, there is no satisfactory answer to this problem. Some characters will be lost. The last line of the sample code instructs Python to use the 'replace' error-handling behavior instead of the default 'strict' behavior. This way, although some characters will be replaced with question marks, at least no exception will be thrown.

Of course, it would be better to use a display that can handle all Unicode characters, such as a Tk GUI.
print and Unicode strings

We now come to the most puzzling aspect of Python's Unicode support. Attempting to print a Unicode string causes an error:

>>> print u'\N{POUND SIGN}'
Traceback (most recent call last):
File "", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)

Two elements combine to cause this error:

1. Python's default encoding is ASCII. The pound sign is not an ASCII character. (By contrast, Java's default encoding is usually something like Latin-1, which covers a bit more ground than ASCII.)
2. The default error behavior is 'strict'. If Python encounters a character that it can't encode, it raises a UnicodeError. (This is different from Java, which silently replaces the character with a ? instead.)

Python defaults to ASCII because ASCII is the only thing likely to work everywhere. The correct encoding is not always Latin-1. In fact, it depends on how you are accessing Python.

When Python executes a print statement, it simply passes the output to the operating system (using fwrite() or something like it), and some other program is responsible for actually displaying that output on the screen. For example, on Windows, it might be the Windows console subsystem that displays the result. Or if you're using Windows and running Python on a Unix box somewhere else, your Windows SSH client is actually responsible for displaying the data. If you are running Python in an xterm on Unix, then xterm and your X server handle the display.

To print data reliably, you must know the encoding that this display program expects.

Earlier it was mentioned that IBM PC computers use the "IBM Code Page 437" character set at the BIOS level. The Windows console still emulates CP437. So this print statement will work, on Windows, under a console window.

# Windows console mode only
>>> s = u'\N{POUND SIGN}'
>>> print s.encode('cp-437')
?

Several SSH clients display data using the Latin-1 character set; Tkinter assumes UTF-8, when 8-bit strings are passed into it. So in general it is not possible to determine what encoding to use with print. It is therefore better to send Unicode output to files or Unicode-aware GUIs, not to sys.stdout.

Posted by ne12abaa3python at 2:27 AM
Learn Python in 10 minutes
Learn Python in 10 minutes Preliminary fluff So, you want to learn the Python programming language but can't find a concise and yet full-featured tutorial. This tutorial will attempt to teach you Python in 10 minutes. It's probably not so much a tutorial as it is a cross between a tutorial and a cheatsheet. I assume that you are already familiar with programming and will, therefore, skip most of the non-language-specific stuff. The important keywords will be highlighted so you can easily spot them. Also, pay attention because, due to the terseness of this tutorial, some things will be introduced directly in code and only briefly commented on. Properties Python is strongly typed (i.e. types are enforced), dynamically, implicitly typed (i.e. you don't have to declare variables), case sensitive (i.e. var and VAR are two different variables) and object-oriented (i.e. everything is an object). Syntax Python has no mandatory statement termination characters and blocks are specified by indentation. Indent in to begin a block, indent out to end one. Statements that expect an indentation level end in a colon (:). Comments start with the pound (#) sign and are single-line. Values are assigned with the equals sign ("="), and equality testing it done using two equals signs ("=="). You can increment/decrement values using the += and -= operators respectively. This works on many datatypes, strings included. For example: intMyVar = 3 intMyVar += 2 intMyVar -= 1 strMyVar = "Hello" strMyVar += " world." Data types The data types available in python are lists, tuples and dictionaries. Sets are available in the sets library. Lists are one-dimensional arrays (but you can have lists of lists), dictionaries are associative arrays (a.k.a. hash tables) and tuples are immutable one-dimensional arrays. The first item in all array types is 0. Negative numbers count from the end towards the beginning, -1 is the last item. Variables can point to functions. The usage is as follows: lstSample = [1, ["another", "list"], ("a", "tuple")] lstList = ["List item 1", 2, 3.14] lstList[0] = "List item 1 again" lstList[-1] = 3.14 dicDictionary = {"Key 1": "Value 1", 2: 3, "pi": 3.14} dicDictionary["pi"] = 3.15 tplTuple = (1, 2, 3) fnVariable = len >>> print fnVariable(lstList) 3 You can access array ranges using a colon (:). Leaving the start index empty assumes the first item, leaving the end index assumes the last item like so: lstList = ["List item 1", 2, 3.14] >>> print lstList[:] ['List item 1', 2, 3.1400000000000001] >>> print lstList[0:2] ['List item 1', 2] >>> print lstList[-3:-1] ['List item 1', 2] >>> print lstList[1:] [2, 3.14] Strings Its strings can use either single or double quotation marks, and you can have quotation marks of one kind inside a string that uses the other kind (i.e. "He said 'hello'." is valid). Multiline strings are enclosed in triple double (or single) quotes ("""). Python supports Unicode out of the box, using the syntax u"This is a unicode string". To fill a string with values, you use the % (modulo) operator and a tuple. Each %s gets replaced with an item from the tuple, left to right, and you can also use dictionary substitutions, like so: >>>print "Name: %s\nNumber: %s\nString: %s" % (class.name, 3, 3 * "-") Name: Poromenos Number: 3 String: --- strString = """This is a multiline string.""" # WARNING: Watch out for the trailing s in "%(key)s". >>> print "This %(verb)s a %(noun)s." % {"noun": "test", "verb": "is"} This is a test. Flow control statements Flow control statements are while, if, and for. There is no select; instead, use if. Use for to enumerate through members of a list. To obtain a list of numbers, use range(). These statements' syntax is thus: lstRange = range(10) >>> print lstRange [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] for intNumber in lstRange: # Check if intNumber is one of # the numbers in the tuple. if intNumber in (3, 4, 7, 9): # Break terminates a for without # executing the "else" clause. break else: # The "else" clause is optional and is # executed only if the loop didn't "break". pass # Do nothing if lstRange[1] == 2: print "1 == 2" elif lstRange[2] == 3: print "3 == 4" else: print "Dunno" while lstRange[1] == 1: pass Functions Functions are declared with the "def" keyword. Optional arguments are set in the function declaration after the mandatory arguments by being assigned a default value. For named arguments, the name of the argument is assigned a value. Functions can return a tuple (and using tuple unpacking you can effectively return multiple values). Lambda functions are ad hoc functions that are comprised of a single statement. Arguments are passed by value. For example: # arg2 and arg3 are optional, they have default values # if one is not passed (100 and "test", respectively). def fnMyFunction(arg1, arg2 = 100, arg3 = "test"): return arg3, arg2, arg1 ret1, ret2, ret3 = fnMyFunction("Argument 1", arg3 = "Named argument") fnVariable = lambda x: x + 1 >>> print fnVariable(1) 2 Classes Python supports a limited form of multiple inheritance in classes. Private variables and methods can be declared by adding at least two leading underscores and at most one trailing one (e.g. "__spam"). We can also assign arbitrary variables to class instances. An example follows: class MyClass: def __init__(self): self.varMyVariable = 3 def fnMyFunction(self, arg1, arg2): return self.varMyVariable # This is the class instantiation >>> clsInstance = MyClass() >>> clsInstance.fnMyFunction(1, 2) 3 # This class inherits from MyClass. Multiple # inheritance is declared as: # class OtherClass(MyClass1, MyClass2, MyClassN) class OtherClass(MyClass): def __init__(self, arg1): self.varMyVariable = 3 print arg1 >>> clsInstance = OtherClass("hello") hello >>> clsInstance.fnMyFunction(1, 2) 3 # This class doesn't have a .test member, but # we can add one to the instance anyway. Note # that this will only be a member of clsInstance. >>> clsInstance.test = 10 >>> clsInstance.test 10 Exceptions Exceptions in Python are handled with try-except [exceptionname] blocks: def fnExcept(): try: # Division by zero raises an exception 10 / 0 except ZeroDivisionError: print "Oops, invalid." >>> fnExcept() Oops, invalid. Importing External libraries are used with the import [libname] keyword. You can also use from [libname] import [funcname] for individual functions. Here is an example: import random from time import clock intRandom = random.randint(1, 100) >>> print intRandom 64 File I/O Python has a wide array of libraries built in. As an example, here is how serializing (converting data structures to strings using the pickle library) with file I/O is used: import pickle lstList = ["This", "is", 4, 13327] # Open the file C:\file.dat for writing. The letter r before the # filename string is used to prevent backslash escaping. flFile = file(r"C:\file.dat", "w") pickle.dump(lstList, flFile) flFile.close() flFile = file(r"C:\file.txt", "w") flFile.write("This is a sample string") flFile.close() flFile = file(r"C:\file.txt") >>> print flFile.read() 'This is a sample string' flFile.close() # Open the file for reading. flFile = file(r"C:\file.dat") lstLoaded = pickle.load(flFile) flFile.close() >>> print lstLoaded ['This', 'is', 4, 13327] Miscellaneous * Conditions can be chained. 1 < a < 3 checks that a is both less than 3 and more than 1. * You can use del to delete variables or items in arrays. * List comprehensions provide a powerful way to create and manipulate lists. They consist of an expression followed by a for clause followed by zero or more if@ or @for clauses, like so: lst1 = [1, 2, 3] lst2 = [3, 4, 5] >>> print [x * y for x in lst1 for y in lst2] [3, 4, 5, 6, 8, 10, 9, 12, 15] >>> print [x for x in lst1 if 4 > x > 1] [2, 3] del lst1[0] >>> print lst1 [2, 3] del lst1 Epilogue This tutorial is not meant to be an exhaustive list of all (or even a subset) of Python. Python has a vast array of libraries and much much more functionality which you will have to discover through other means, such as the excellent online book Dive into Python. I hope I have made your transition in Python easier. Please leave comments if you believe there is something that could be improved or added. Update: This article has been Dugg. Please post a comment here if there is anything else you would like to see, (classes, error handling, anything).

Posted by ne12abaa3python at 2:07 AM

Newer | Latest | Older