Thursday 20 December 2012

Mypy Development Update #1

The mypy project has been progressing smoothly during the last two or so weeks after the source release.

Latest changes:

  • I added a lot of content to the mypy language overview. It now covers more language features, explains common issues encountered when using static typing and describes the translation process to Python in some detail.
  • The wiki now contains instructions for adding support for additional Python modules by creating library stubs.
  • There have been several other updates to the wiki. It's starting to be useful tool for developers and users.
  • Several bugs in the mypy implementation have been fixed, and the type checker now supports type inference for lambdas. Also type inference of generic functions such as map has improved. Code like this now works as expected:
        print(list(map(str, [1, 2, 3])))
        
  • Work on the C back end has begun. I started porting some 2000 lines of code from my earlier Alore-to-Java compiler prototype to mypy. It's still going to take a few more days to port the code. If everything goes as planned, we should be able to compile some simple mypy code to reasonably efficient native code in 3 or 4 weeks, perhaps.
  • Even though the development focus is now on the C back end, I will also continue improving the type checker and the mypy-to-Python translator. One of the important milestones will be being able to port some Python standard library modules to static typing without too much effort.

I'm going to continue posting periodic updates like this that highlight the latest developments in the project.

Friday 14 December 2012

Friday 7 December 2012

Source Code Released

Mypy source code is available on GitHub:

https://github.com/JukkaL/mypy

Clone the repo and give it a try! Currently the mypy implementation lets you mix static types and dynamic types and translate mypy programs to readable Python. Type annotations and casts are treated as comments when translating to Python. As such there is no performance boost yet. The current prototype supports a useful but somewhat limited subset of Python features (library support is still limited).

There is also an issue tracker for reporting bugs.

Why Mypy Can Be More Efficient Than Python

Several projects have tried to fix the inefficiency of CPython, with varying degrees of success. The most notable of these is PyPy, which is a JIT compiler for Python (among other things). Mypy is different from the previous projects in several ways, and I believe that these differences are the keys to understanding why I chose the mypy approach.

Some projects base their work on the CPython VM; examples include Cython and Nuitka. These have two major problems:

  1. The GIL (Global Interpreter Lock) is deeply ingrained in the CPython implementation. There seems to be no way to get rid of it without either actually degrading performance further or rearchitecting a lot of the VM and potentially breaking backward compatibility.
  2. The CPython VM was designed to be easy to develop, not to have high performance. Again, it is practically impossible to improve performance significantly without major rewrites and breaking compatibility, as this was not a goal in the original architecture. Mind you, Python is a very useful language and widely used, so I'm not implying that the original goals were somehow wrong; only that some of the goals are different from what I and many other people want.

Mypy will not have these problems since it uses a fresh VM implementation. These issues are not exceedingly difficult to solve if they are taken into consideration early enough in development. Still, a new VM alone does not make it easy to get good performance or to get rid of the GIL effectively.

PyPy takes a different approach of implementing Python. PyPy has reimplemented all the VM infrastructure and it has a new interpreter and a JIT compiler. However, PyPy aims at almost perfect compatibility with CPython: it holds the Python semantics as almost sacred (different Python implementations actually differ in semantics in important ways; a good example is reference counting versus garbage collection). Even though I have a great deal of respect for the PyPy project, their approach has still major problems that the project has not, in my opinion, solved in a satisfactory way, and that seem to be fundamentally difficult to solve with their chosen constraints:

  1. The GIL is actually deeply ingrained in Python semantics, not only in the implementation. To be fully compatible with CPython, a parallel Python implementation has to effectively introduce locking at a very fine-grained granularity (Discussion of GIL in the Python documentation). By effectively I mean that this locking semantics needs at least be emulated. PyPy is in the process of adopting Software Transactional Memory (STM) for this (blog post). However, Armin Rigo wrote that the performance with STM is at least 2x slower than sequential code, which I find unacceptable. Besides, to me it's still not obvious whether STM is a good match for many programs: STM is still an active research area, and it has not seen a huge mainstream adoption yet.
  2. The dynamism of Python, even without the problem of the GIL ("almost everything is a dictionary"), makes it difficult to compile Python to efficient native code, and this affects especially heavily object-oriented programs. PyPy tackles it by using a trace compiler, but it often results in very slow program "warm-up" (sometimes several minutes!) and high memory usage. In my admittedly small experiments with PyPy, a program that runs for a couple of seconds ran almost fully interpreted (according to the PyPy profiler). As the PyPy interpreter is about twice as slow as in CPython, the program ran about 2x as slowly with PyPy than with CPython!

Mypy deals with these problems in three ways. These are key to why I believe that mypy can achieve its goals. These are also key to why mypy will never be 100% compatible with Python. In my opinion this is a fair tradeoff, and in practice I believe that it is almost a necessary tradeoff in order to make the performance of Python competitive with other languages such as Java or C#, which don't have the above issues:

  1. Mypy will make more things immutable by default than Python. In particular, classes are immutable unless otherwise specified (analogous to extension or C classes in CPython). This reduces the need for fine-grained locking a lot, and it also makes it possible to optimize many operations such as method lookups in ways that are difficult to achieve with Python semantics.
  2. Mypy does not guarantee implicit locking at boundaries corresponding to CPython bytecodes. Due to the additional immutability mentioned above, often this actually does not make any practical difference for programs. For example, dictionaries have no implicit locking. However, relying on such low-level implementation details for implicit synchronization in Python programs is begging for trouble. The implicit locking is defined in terms of the CPython bytecode format, and not many programmers know or even want to know into what kind of bytecode their code compiles.
  3. Mypy will support static typing with reasonable runtime soundness guarantees (the soundness properties will be sometimes stronger than Java, sometimes similar). These allow a mypy implementation to get rid of the final stumbling block of running Python programs efficiently: dynamic typing. Due to the points 1 and 2 above, also dynamically typed mypy can be faster than Python, but for close to optimal performance programs should use static typing at least in the hot spots.

So the net result is that mypy can support efficient ahead-of-time compilation with fast program startup, and GIL-free efficient threading, with overhead comparable to that of statically typed languages with safe runtime semantics. Mypy can still support a lof of the dynamism of Python; however it is opt-in. Unlike Python, you only pay for the dynamism that you use. You will be able to declare a method mutable and do monkey patching, but most programs hardly ever do this and they don't have to pay a hefty performance penalty for merely having the option of doing this.

Thursday 6 December 2012

Source Code Release on Friday

I've been busy implementing new Python features and fixing bugs this week. The type checker now supports a pretty good Python subset, even though some important functionality is still missing. A week or so ago almost each new piece of real Python code that I adapted to static typing unearthed a bug or two, but now things are significantly better (though there will be bugs, trust me).

More importantly, I believe that mypy is starting to be good enough for experimentation by outside developers. I've decided that the source release is tomorrow (Friday), barring something unforeseen such as a December blizzard that leaves me out of electricity. ;-)

Remember that the current implementation translates mypy code to Python, so there is no performance boost. However, the current implementation should give you a good idea of what it is like to program with mypy and static typing (though obviously this is just a prototype and the experience will get better!). I'm starting to shift development efforts towards the native code compiler, which is the really interesting bit.

Tuesday 4 December 2012

First Post

Several people have asked me about an easy way of keeping track of updates to the mypy project. I will use this blog to periodically post about what's happening in mypy development. I am also planning to write more technical posts about various language features, design issues and other topics that might be of interest to the readership. Also feel free to leave comments and ask questions about any topic related to the blog posts.