Friday, September 5, 2014

Python Pitfalls

#1: Misusing expressions as defaults for function arguments

Python allows you to specify that a function argument is optional by providing a default value for it. While this is a great feature of the language, it can lead to some confusion when the default value is mutable. For example, consider this Python function definition:
>>> def foo(bar=[]):        # bar is optional and defaults to [] if not specified
...    bar.append("baz")    # but this line could be problematic, as we'll see...
...    return bar
A common mistake is to think that the optional argument will be set to the specified default expression each time the function is called without supplying a value for the optional argument. In the above code, for example, one might expect that calling foo() repeatedly (i.e., without specifying a bar argument) would always return 'baz', since the assumption would be that each time foo() is called (without a bar argument specified) bar is set to [] (i.e., a new empty list).
But let’s look at what actually happens when you do this:
>>> foo()
["baz"]
>>> foo()
["baz", "baz"]
>>> foo()
["baz", "baz", "baz"]
Huh? Why did it keep appending the default value of "baz" to an existing list each time foo() was called, rather than creating a new list each time?
The answer is that the default value for a function argument is only evaluated once, at the time that the function is defined. Thus, the bar argument is initialized to its default (i.e., an empty list) only when foo() is first defined, but then calls to foo() (i.e., without a bar argument specified) will continue to use the same list to which bar was originally initialized.
FYI, a common workaround for this is as follows:
>>> def foo(bar=None):
...    if bar is None:  # or if not bar:
...        bar = []
...    bar.append("baz")
...    return bar
...
>>> foo()
["baz"]
>>> foo()
["baz"]
>>> foo()
["baz"]

#2: Using class variables incorrectly

Consider the following example:
>>> class A(object):
...     x = 1
...
>>> class B(A):
...     pass
...
>>> class C(A):
...     pass
...
>>> print A.x, B.x, C.x
1 1 1
Makes sense.
>>> B.x = 2
>>> print A.x, B.x, C.x
1 2 1
Yup, again as expected.
>>> A.x = 3
>>> print A.x, B.x, C.x
3 2 3
What the $%#!&?? We only changed A.x. Why did C.x change too?
In Python, class variables are internally handled as dictionaries and follow what is often referred to as Method Resolution Order (MRO). So in the above code, since the attribute x is not found in class C, it will be looked up in its base classes (only A in the above example, although Python supports multiple inheritance). In other words, C doesn’t have its own x property, independent of A. Thus, references to C.x are in fact references to A.x.

#3: Specifying parameters incorrectly for an exception block

Suppose you have the following code:
>>> try:
...     l = ["a", "b"]
...     int(l[2])
... except ValueError, IndexError:  # To catch both exceptions, right?
...     pass
...
Traceback (most recent call last):
  File "", line 3, in 
IndexError: list index out of range
The problem here is that the except statement does not take a list of exceptions specified in this manner. Rather, In Python 2.x, the syntax except Exception, e is used to bind the exception to the optional second parameter specified (in this case e), in order to make it available for further inspection. As a result, in the above code, the IndexError exception is not being caught by the except statement; rather, the exception instead ends up being bound to a parameter named IndexError.
The proper way to catch multiple exceptions in an except statement is to specify the first parameter as a tuple containing all exceptions to be caught. Also, for maximum portability, use the as keyword, since that syntax is supported by both Python 2 and Python 3:
>>> try:
...     l = ["a", "b"]
...     int(l[2])
... except (ValueError, IndexError) as e:  
...     pass
...
>>>

#4: Misunderstanding Python scope rules

Python scope resolution is based on what is known as the LEGB rule, which is shorthand for Local, Enclosing, Global, Built-in. Seems straightforward enough, right? Well, actually, there are some subtleties to the way this works in Python. Consider the following:
>>> x = 10
>>> def foo():
...     x += 1
...     print x
...
>>> foo()
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 2, in foo
UnboundLocalError: local variable 'x' referenced before assignment
What’s the problem?
The above error occurs because, when you make an assignment to a variable in a scope, that variable is automatically considered by Python to be local to that scope and shadows any similarly named variable in any outer scope.
Many are thereby surprised to get an UnboundLocalError in previously working code when it is modified by adding an assignment statement somewhere in the body of a function. (You can read more about this here.)
It is particularly common for this to trip up developers when using lists. Consider the following example:
>>> lst = [1, 2, 3]
>>> def foo1():
...     lst.append(5)   # This works ok...
...
>>> foo1()
>>> lst
[1, 2, 3, 5]

>>> lst = [1, 2, 3]
>>> def foo2():
...     lst += [5]      # ... but this bombs!
...
>>> foo2()
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 2, in foo
UnboundLocalError: local variable 'lst' referenced before assignment
Huh? Why did foo2 bomb while foo1 ran fine?
The answer is the same as in the prior example, but is admittedly more subtle. foo1 is not making an assignment to lst, whereas foo2 is. Remembering that lst += [5] is really just shorthand for lst = lst + [5], we see that we are attempting to assign a value to lst (therefore presumed by Python to be in the local scope). However, the value we are looking to assign to lst is based on lst itself (again, now presumed to be in the local scope), which has not yet been defined. Boom.

#5: Modifying a list while iterating over it 

The problem with the following code should be fairly obvious:
>>> odd = lambda x : bool(x % 2)
>>> numbers = [n for n in range(10)]
>>> for i in range(len(numbers)):
...     if odd(numbers[i]):
...         del numbers[i]  # BAD: Deleting item from a list while iterating over it
...
Traceback (most recent call last):
     File "", line 2, in 
IndexError: list index out of range
Deleting an item from a list or array while iterating over it is a faux pas well known to any experienced software developer. But while the example above may be fairly obvious, even advanced developers can be unintentionally bitten by this in code that is much more complex.
Fortunately, Python incorporates a number of elegant programming paradigms which, when used properly, can result in significantly simplified and streamlined code. A side benefit of this is that simpler code is less likely to be bitten by the accidental-deletion-of-a-list-item-while-iterating-over-it bug. One such paradigm is that of list comprehensions. Moreover, list comprehensions are particularly useful for avoiding this specific problem, as shown by this alternate implementation of the above code which works perfectly:
>>> odd = lambda x : bool(x % 2)
>>> numbers = [n for n in range(10)]
>>> numbers[:] = [n for n in numbers if not odd(n)]  # ahh, the beauty of it all
>>> numbers
[0, 2, 4, 6, 8]

#6: Confusing how Python binds variables in closures

Considering the following example:
>>> def create_multipliers():
...     return [lambda x : i * x for i in range(5)]
>>> for multiplier in create_multipliers():
...     print multiplier(2)
...
You might expect the following output:
0
2
4
6
8
But you actually get:
8
8
8
8
8
Surprise!
This happens due to Python’s late binding behavior which says that the values of variables used in closures are looked up at the time the inner function is called. So in the above code, whenever any of the returned functions are called, the value of i is looked up in the surrounding scope at the time it is called (and by then, the loop has completed, so i has already been assigned its final value of 4).
The solution to this is a bit of a hack:
>>> def create_multipliers():
...     return [lambda x, i=i : i * x for i in range(5)]
...
>>> for multiplier in create_multipliers():
...     print multiplier(2)
...
0
2
4
6
8
VoilĂ ! We are taking advantage of default arguments here to generate anonymous functions in order to achieve the desired behavior. Some would call this elegant. Some would call it subtle. Some hate it. But if you’re a Python developer, it’s important to understand in any case.

#7: Creating circular module dependencies

Let’s say you have two files, a.py and b.py, each of which imports the other, as follows:
In a.py:
import b

def f():
    return b.x
 
print f()
And in b.py:
import a

x = 1

def g():
    print a.f()
First, let’s try importing a.py:
>>> import a
1
Worked just fine. Perhaps that surprises you. After all, we do have a circular import here which presumably should be a problem, shouldn’t it?
The answer is that the mere presence of a circular import is not in and of itself a problem in Python. If a module has already been imported, Python is smart enough not to try to re-import it. However, depending on the point at which each module is attempting to access functions or variables defined in the other, you may indeed run into problems.
So returning to our example, when we imported a.py, it had no problem importing b.py, since b.py does not require anything from a.py to be defined at the time it is imported. The only reference in b.py to a is the call to a.f(). But that call is in g() and nothing in a.py or b.py invokes g(). So life is good.
But what happens if we attempt to import b.py (without having previously imported a.py, that is):
>>> import b
Traceback (most recent call last):
     File "", line 1, in <module>
     File "b.py", line 1, in <module>
    import a
     File "a.py", line 6, in <module>
 print f()
     File "a.py", line 4, in f
 return b.x
AttributeError: 'module' object has no attribute 'x'
Uh-oh. That’s not good! The problem here is that, in the process of importing b.py, it attempts to import a.py, which in turn calls f(), which attempts to access b.x. But b.x has not yet been defined. Hence the AttributeError exception.
At least one solution to this is quite trivial. Simply modify b.py to import a.py within g():
x = 1

def g():
    import a # This will be evaluated only when g() is called
    print a.f()
No when we import it, everything is fine:
>>> import b
>>> b.g()
1 # Printed a first time since module 'a' calls 'print f()' at the end
1 # Printed a second time, this one is our call to 'g'

#8: Name clashing with Python Standard Library modules

One of the beauties of Python is the wealth of library modules that it comes with “out of the box”. But as a result, if you’re not consciously avoiding it, it’s not that difficult to run into a name clash between the name of one of your modules and a module with the same name in the standard library that ships with Python (for example, you might have a module named email.py in your code, which would be in conflict with the standard library module of the same name).
This can lead to gnarly problems, such as importing another library which in turns tries to import the Python Standard Library version of a module but, since you have a module with the same name, the other package mistakenly imports your version instead of the one within the Python Standard Library. This is where bad stuff happens.
Care should therefore be exercised to avoid using the same names as those in the Python Standard Library modules. It’s way easier for you to change the name of a module within your package than it is to file a Python Enhancement Proposal (PEP) to request a name change upstream and to try and get that approved.

#9: Failing to address differences between Python 2 and Python 3

Consider the following file foo.py:
import sys

def bar(i):
    if i == 1:
        raise KeyError(1)
    if i == 2:
        raise ValueError(2)

def bad():
    e = None
    try:
        bar(int(sys.argv[1]))
    except KeyError as e:
        print('key error')
    except ValueError as e:
        print('value error')
    print(e)

bad()
On Python 2, this runs fine:
$ python foo.py 1
key error
1
$ python foo.py 2
value error
2
But now let’s give it a whirl on Python 3:
$ python3 foo.py 1
key error
Traceback (most recent call last):
  File "foo.py", line 19, in 
    bad()
  File "foo.py", line 17, in bad
    print(e)
UnboundLocalError: local variable 'e' referenced before assignment
What has just happened here? The “problem” is that, in Python 3, the exception object is not accessible beyond the scope of the except block. (The reason for this is that, otherwise, it would keep a reference cycle with the stack frame in memory until the garbage collector runs and purges the references from memory. More technical detail about this is available here).
One way to avoid this issue is to maintain a reference to the exception object outside the scope of the except block so that it remains accessible. Here’s a version of the previous example that uses this technique, thereby yielding code that is both Python 2 and Python 3 friendly:
import sys

def bar(i):
    if i == 1:
        raise KeyError(1)
    if i == 2:
        raise ValueError(2)

def good():
    exception = None
    try:
        bar(int(sys.argv[1]))
    except KeyError as e:
        exception = e
        print('key error')
    except ValueError as e:
        exception = e
        print('value error')
    print(exception)

good()
Running this on Py3k:
$ python3 foo.py 1
key error
1
$ python3 foo.py 2
value error
2
Yippee!

#10: Misusing the __del__ method

Let’s say you had this in a file called mod.py:
import foo

class Bar(object):
        ...
    def __del__(self):
        foo.cleanup(self.myhandle)
And you then tried to do this from another_mod.py:
import mod
mybar = mod.Bar()
You’d get an ugly AttributeError exception.
Why? Because, as reported here, when the interpreter shuts down, the module’s global variables are all set to None. As a result, in the above example, at the point that __del__ is invoked, the name foo has already been set to None.
A solution would be to use atexit.register() instead. That way, when your program is finished executing (when exiting normally, that is), your registered handlers are kicked off before the interpreter is shut down.
With that understanding, a fix for the above mod.py code might then look something like this:
import foo
import atexit

def cleanup(handle):
    foo.cleanup(handle)


class Bar(object):
    def __init__(self):
        ...
        atexit.register(cleanup, self.myhandle)
This implementation provides a clean and reliable way of calling any needed cleanup functionality upon normal program termination. Obviously, it’s up to foo.cleanup to decide what to do with the object bound to the name self.myhandle, but you get the idea. But while were at it..

#11: __del__ Can't be Trusted

The mere existence of this method makes objects that are part of a reference cycle uncollectable by Python's garbage collector and could lead to memory leaks.
Use a weakref.ref object with a callback to run code when an object is being removed instead.
See also Python gc module documentation

#12: Using os.system or os.popen instead of subprocess

Starting with the non-controversial: Anything that has been marked deprecated should be avoided. The deprecation warning should have instructions with safe alternatives you can use.
Some of the most frequent offenders are parts of the language that make it difficult to safely call other programs:
os.system()

os.popen()

import commands
We have the excellent subprocess module for these now, use it.

#13: Not using duck typing

Explicitly checking the type of a parameter passed to a function breaks the expected duck-typing convention of Python. Common type checking includes:
isinstance(x, X)

type(x) == X
With type() being the worse of the two.
If you must have different behaviour for different types of objects passed, try treating the object as the first data type you expect, and catching the failure if that type wasn't that type, and then try the second. This allows users to create objects that are close enough to the types you expect and still use your code.
See also isinstance() considered harmful.

#14: Using pickle to serialize data

import pickle # or cPickle
Objects serialized with pickle are tied to their implementations in the code at that time. Restoring an object after an underlying class has changed will lead to undefined behaviour. Unserializing pickled data from an untrusted source can lead to remote exploits. The pickled data itself is opaque binary that can't be easily edited or reviewed.
This leaves only one place where pickle makes sense -- short lived data being passed between processes, just like what the multiprocessing module does.
Anywhere else use a different format. Use a database or use JSON with a well-defined structure. Both are restricted to simple data types and are easily verified or updated outside of your Python script. See also Alex Gaynor's presentation on pickle.

#15: Misusing demonstration modules

Many people are drawn to these modules because they are part of Python's standard library. Some people even try to do serious work with them.
import asyncore

import asynchat

import SimpleHTTPServer
The former resembles a reasonable asynchronous library, until you find out there are no timers. At all. Use Twisted or Tornado instead.
The latter makes for a neat demo by letting giving you a web server in your pocket with the one command python -m SimpleHTTPServer. But this code was never intended for production use, and certainly not designed to be run as a public web server. There are plenty of real, hardened web servers out there that will run your python code as a WSGI script. Choose one of them instead.

#16: Using import array

import array

All the flexibility and ease of use of C arrays, now in Python!
If you really really need this you will know. Interfacing with C code in an extension module is one valid reason.
If you're looking for speed, try just using regular python lists with PyPy . Another good choice is NumPy for its much more capable arrays types.


#17: Split Personality

reload(x)
It looks like the code you just changed is there, except the old versions of everything is still there too. Objects created before the reload will still use the code as it was when they were created, leading to situations with interesting effects that are almost impossible to reproduce.
Just re-run your program. If you're debugging at the interactive prompt consider debugging with a small script and python -i instead.

#18: Copy is Almost Reasonable

import copy
The copy module is harmless enough when used on objects that you create and you fully understand. The problem is once you get in the habit of using it, you might be tempted to use it on objects passed to you by code you don't control.
Copying arbitrary objects is troublesome because you will often copy too little or too much. If this object has a reference to an external resource it's unclear what copying that even means. It can also easily lead to subtle bugs introduced into your code by a change outside your code.
If you need a copy of a list or a dict, use list() or dict()``because you can be sure what you will get after they are called.  ``copy(), however might return anything, and that should scare you.

#19: Admit You Always Hated It

if __name__ == '__main__':
This little wart has long been a staple of many python introductions. It lets you treat a python script as a module, or a module as a python script. Clever, sure, but it's better to keep your scripts and modules separate in the first place.
If you treat a module like a script, then something imports the module you're in trouble: now you have two copies of everything in that module.
I have used this trick to make running tests easier, but setuptools already provides a better hook for running tests. For scripts setuptools has an answer too, just give it a name and a function to call, and you're done.
My last criticism is that a single line of python should never be 10 alphanumeric characters and 13 punctuation characters. All those underscores are there as a warning that some special non-obvious language-related thing is going on, and it's not even necessary.
See also setuptools/distribute automatic script creation
and also PEP 366 pointed out by agentultra on HN

#20: Don't Emulate stdlib

It's in standard library, it must be well written, right?
May I present the implementation of namedtuple, which is really handy little class that used properly can significantly improve your code's readability.
def namedtuple(typename, field_names, verbose=False, rename=False):
    # Parse and validate the field names.  Validation serves two purposes,
    # generating informative error messages and preventing template injection attacks.
Wait, what? "preventing template injection attacks"?
This is followed by 27 lines of code that validates field_names. And then:
template = '''class %(typename)s(tuple):
   '%(typename)s(%(argtxt)s)' \n
   __slots__ = () \n
   _fields = %(field_names)r \n
   def __new__(_cls, %(argtxt)s):
       'Create new instance of %(typename)s(%(argtxt)s)'
       return _tuple.__new__(_cls, (%(argtxt)s)) \n
   @classmethod
   def _make(cls, iterable, new=tuple.__new__, len=len):
       'Make a new %(typename)s object from a sequence or iterable'
       result = new(cls, iterable)
       if len(result) != %(numfields)d:
       raise TypeError('Expected %(numfields)d arguments, got %%d' %% len(result))
       return result \n
   def __repr__(self):
       'Return a nicely formatted representation string'
       return '%(typename)s(%(reprtxt)s)' %% self \n
   def _asdict(self):
       'Return a new OrderedDict which maps field names to their values'
       return OrderedDict(zip(self._fields, self)) \n
   __dict__ = property(_asdict) \n
   def _replace(_self, **kwds):
       'Return a new %(typename)s object replacing specified fields with new values'
       result = _self._make(map(kwds.pop, %(field_names)r, _self))
       if kwds:
           raise ValueError('Got unexpected field names: %%r' %% kwds.keys())
       return result \n
   def __getnewargs__(self):
       'Return self as a plain tuple.  Used by copy and pickle.'
       return tuple(self) \n\n''' % locals()
Yes, that's a class definition in a big Python string, filled with variables from locals(). The result is then exec -ed in the right namespace, and some further magic is applied to "fix" copy() and pickle().
I believe this code was meant some sort of warning to people that would contribute code to Python -- something like "We make it look like we know what we're doing, but we're really just nuts" (love ya Raymond)
See also collections.py source code
and also an attempted fix pointed out by ffrinch on reddit

#21: Trying Too Hard

hasattr(obj, 'foo')
hasattr() has always been defined to swallow all exceptions, even ones you might be interested in (such as a KeyboardInterrupt) and turn them into a False return value. This interface just can't be fixed, use getattr() with a sentinel value instead.

#22: Off by One

'hello'.find('H')
str.find() and str.rfind() return -1 on failure. This can lead to some really hard to find bugs when combined with containers like strings that treat -1 as the last element. Use str.index() and str.rindex() instead.

No comments:

Post a Comment