Recently (well in a loose sense anyway) I had the need to build a document bank in PHP for a client at Mosaic. It was a fairly involved application with various public and private APIs for integration into the clients network of websites.
The core PHP code was written on top of the Agavi framework and various PHP libraries for extracting text and meta data from documents. One of the major features the client required was for the system to detect similar files to prevent unintentional duplicates making it into the document bank.
The idea was that this document bank would be the one central resource for all of the documents written and managed by the organisation. Duplicates or near duplicates would of course make this a pointless exercise. So I turned to StackOverflow for some pointers, but came up empty.
After some research and much searching of the web I came across an open source package called ssdeep written by Jesse Kornblum. I found it through reading his research papers; Identifying almost identical files using context triggered piecewise hashing.
ssdeep is based upon work by Andrew Tridgell of samba fame who produced spamsum and the basis of the mathematics behind ssdeep. To summarise ssdeep would be to say that it can detect homologous files or signatures in files.
Despite the fact that ssdeep was originally intended to be used for malware detection it is equally suited to the more mundane task of detecting duplicate documents.
With this discovery I immediately began creating a prototype version written in basic PHP that would serve as a wrapper around the ssdeep binary. I have, by request, made this code public , but it is a pretty old hack and I would not recommend using it.
As I got this prototype up and running I began to see how powerful ssdeep was, but with one small caveat - it works best on files above 4KB as noted in an erroneous bug report on the ssdeep package in PECL. In my application this was fine as it was handling large PDF and Word documents for the most part.
Soon I became aware that there was API for the ssdeep package that I could extend to create a PHP extension. So I spent quite some time figuring out how to actually write a PHP extension from various sources and then even more time looking into the autoconf build process.
If you are interested in writing your own extension I have documented my resources previously on this blog.
After some annoying errors with baffling outputs I finally had my extension written, building and tested. Pretty soon it was a on a production box and working like a dream on thousands of documents.
Now I wanted to share the code for others to use so initially I hosted it all on github, but soon realised it would get far more exposure if it was included in PECL. I also wanted to have ssdeep properly documented in the main PHP manual to further promote the extension.
This did look unlikely in the beginning as the ssdeep package is licenced using GPL, which cannot be accepted into PECL due to its viral nature. Thankfully after contacting Jesse it became clear that the original work by Tridgell had been dual licenced and he could therefore grant me an exemption from GPL for the purposes of PECL.
After completing the application process, accepting code reviews and jumping the legal hurdles I was finally ready to publish my first PECL extension! I began building the PECL extension locally and installing it on as many machines as possible through the PECL method.
Thankfully it all went to plan apart from a bug I discovered in Pyrus and I managed to get a release up. Next was the process of documenting it all for PHP, suffice to say its not as easy as it sounds. (Two spaces of indent not four!)
In the end though with thanks to Pierre, Johannes, Gustavo and of course Jesse I had released my first extension into the wild.
There you have it. That is how a PHP extension is born and merged into PECL.