Thursday, 31 January 2013

Mypy Native Code Back End: C vs LLVM

Mypy will initially use C as an intermediate step to compile to native code. I also seriously considered LLVM, and several people have recommended LLVM.

It was a tough decision; LLVM would probably be a good match as well. Here's my reasoning for choosing C:

  • C works everywhere. It's very stable and supported on older and more exotic systems as well.
  • A huge number of developers know C. If the back end uses C, more people will be able to help and debug problems. By contrast, LLVM is mostly used by specialists such as programming language implementors and researchers.
  • I know C very well. LLVM might have some problems or imitations that I'm not not aware of yet, and these might bite me. LLVM is also fairly complex and takes time to learn.
  • We would probably have to implement mypy bindings to the LLVM API. This would have to be in C++, since the LLVM C API does not seem to be very well supported. The API is large, so this would probably take some effort (maybe be a few weeks, maybe longer). We would also have to maintain these bindings. There are Python LLVM bindings, but they haven't been updated recently and I have no idea of how complete and usable they are -- another unknown.
  • C is probably "efficient enough", at least initially. LLVM has some low-level features that could be useful, but I doubt the difference is large in practice.
  • LLVM is slightly lower level than C. This probably translates to more development work.
  • My VM / runtime support code is in C, so it's probably easier to integrate it with a C back end and debug it than when using LLVM.

LLVM would also have benefits:

  • LLVM is designed for using in VMs; C is slightly awkward for this purpose (but it works and has been used in many projects).
  • It would be fairly easy to support JIT/dynamic compilation with LLVM, but with C it would be a pain (e.g. running gcc in a subprocess).
  • An LLVM based compiler would probably have faster compiles, since we wouldn't need the intermediate C parsing step. Besides, the code generator may be faster. But on the other hand, we can always use clang if the difference in code generation speed is large.

Implementing C generation (+ support code) is going to be a pretty small part of the entire mypy project, so rewriting it later to use LLVM is not a big deal. In the long term, LLVM is probably a better bet since we will want to support runtime code generation at some point.

In summary, I'm pretty sure that we can save several weeks of development time by using C initially, and most importantly, it reduces risks and uncertainty. Tackling two big uncertainties at the same time (a new programming language and an unfamiliar back end technology) would be asking for trouble. However, developing an alternative LLVM back end would be a useful project for anybody interested in LLVM and mypy internals.

No comments:

Post a Comment