Sunday, July 15, 2012

Binary data processing in JavaScript

Few days ago, I came across an interesting post about Speaker.js on Hacker News. Speaker.js is a client side library which enables text-to-speech using only JavaScript and HTML. The library doesn’t do any server side calls to do the conversion. This made me thinking about the techniques used in JavaScript to process large amount of binary data. Old school JavaScript doesn’t provide any support for storing binary data. Traditionally normal arrays were used to simulate the behavior of binary arrays by storing a number in the range of 0 to 255 for each element. Obviously, the above technique will not be suitable for applications that require processing of large amount of data. Then, with the introduction of HTML canvas element, developers started using Canvas’ ‘ImageData’ to hold any binary data needed for their applications. Canvas ImageData is still one of the most widely used techniques to deal with binary data. If you are following the developments on HTML5 standardization and web browsers, you must be aware of “Typed Arrays”. JavaScript “Typed Arrays” provides a mechanism for accessing raw binary data much more efficiently. Rest of this post will focus on “Typed Arrays” and performance metrics of all these three techniques.

Typed Arrays

JavaScript “Typed Arrays” provide a mechanism for accessing raw binary data much more efficiently. The specification defines two types: buffer – a generic fixed length buffer type, view - accessor types that allow access to the data stored within the buffer.

  • Buffer (implemented by ArrayBuffer). The ArrayBuffer is a data type that is used to represent a generic, fixed-length binary data buffer. You can't directly manipulate the contents of an ArrayBuffer; instead, you can create an ArrayBufferView object which represents the buffer in a specific format, and use that to read and write the contents of the buffer. The following line of code will create a chunk of memory with 16 bytes pre-initialized to 0. Note: You will not be able to access data using the variable buffer.

var buffer = new ArrayBuffer(16);
  • View (implemented by ArrayBufferView and its subclasses). A view provides a context—that is, a data type, starting offset, and number of elements—that turns the data into an actual typed array. Views are implemented by the ArrayBufferView class and its subclasses. Float32Array, Float64Array, Int8Array, Int16Array, Int32Array, Uint8Array, Uint16Array, Uint32Array are some of the available view classes. There is also a generic view DataView available to read and write data to ArrayBuffer. In the following lines of code, we create a view that treats the data in the buffer as an array of 32-bit signed integers. We can access the data in the buffer just like a normal array. It is possible to create multiple views on the same buffer. By combining a single buffer with multiple views of different types, starting at different offsets into the buffer, we can interact with complex data structures (like data read from a structured file, WebGL, etc).

var int32View = new Int32Array(buffer);
for (var i=0; i<int32View.length; i++) {
int32View[i] = i*2;
}
// 16-bit singed integer view on the same buffer. This is allowed.
var int16View = new Int16Array(buffer);
for (var i=0; i<int16View.length; i++) {
console.log("Entry " + i + ": " + int16View[i]);
}

Browser Support

Performance Tests

 Kanaka has written some test cases to test the performance of these three techniques: Normal Arrays, ImageData and Typed Arrays. The test cases are hosted as part of his noVNC project on github. I ran the same test cases on my Macbook Pro (2 GHz Intel Quad-core i7, 4 GB 1333 MHz DDR3) and it turns out that performance of Chrome is much better than other browsers. In Chrome, ‘Typed Arrays’ proves to be the most efficient technique for manipulating binary data. There are some drastic changes in these metrics when compared to the tests run by Kanaka in April 2011. Test results after averaging out 50 test iterations can be found here.

The Four Tests:

  • Create - For each test iteration, an array is created and then initialized to zero and this is repeated 2000 times.

  • Random read - For each test iteration, 5 million reads are issued to pseudo-random locations in an array.

  • Sequential read - For each test iteration, 5 million reads are issued sequentially to an array. The reads loop around to the beginning of the array when they reach the end of the array.

  • Sequential write - For each test iteration, 5 million updates are made sequentially to an array. The writes loop around to the beginning of the array when they reach the end of the array.

Chrome is the only browser where “Typed Arrays” seems to be performing well. If you want to use a standard technique, then you can go with ‘Typed Arrays’. Otherwise, you will have to wait for other browser vendors to improve the performance of “Typed Arrays”. More information about “Typed Arrays” can be found here.

Update: I have re-ran the test cases after making the changes pointed out by mraleph (here)

  • Explicitly specifying the arraysize while creating normal arrays
  • Instead of creating a single test_something function, created separate functions for each array type. It turns out that performance of other browsers has improved significantly after this change. Interesting inference from this result is that the JS engines in Firefox, Safari and Opera do not seem to handle Polymorphism well.

-- Varun

Saturday, July 7, 2012

Postel’s law and modern day browsers

Today while I was skimming through the feeds from Hacker news, I came across this interesting post -- “A file that’s both an acceptable HTML page and a JPEG image”. Apparently this url (http://lcamtuf.coredump.cx/squirrel/) can be viewed as a HTML document when opened on a browser and the same can be used as a source to the image element. Yes, it does work. As mentioned on the above page, there is no server side trick involved in this. After inspecting the page for sometime and trying out different tools in my arsenal, I have figured that this is related to Postel’s Iaw.

What is Postel’s law?

Postel’s law or the robustness principle is a general design guideline for software: Be conservative in what you do, be liberal in what you accept from others (often reworded as "Be conservative in what you send, liberal in what you accept"). (Definition sourced from Wikipedia.) Modern day browsers follow this principle to a greater extent. Let me explain this with an example. Create an empty file and name it as “something.html”. Just add a line of text without any HTML tags. The browsers will render the content even though there are no HTML tags available. Now add html, head and body tags and leave them open without closing. Browsers render the pages even if they are not well formed. In other words, browsers are liberal enough in accepting the contents to render.

The Squirrel page

Now, lets use Postel’s law to understand how the Squirrel page works. When we request the url as a HTML document or as an image, the same content is being delivered to the browser (check the screenshot of Chrome Network panel). The content sent from the server is actually an image with HTML contents embedded on it. The browser interprets the content depending on the context in which it is used.

Chrome Network panel showing the data transferred when the page is viewed as a HTML document and as an imagePage source showing Junk characters

  • Open the page on a browser to view the HTML contents. If you view the page source, you will find some junk characters in the source. You will find a set of junk characters at the beginning of the file and some more contents at the end of file within HTML comments.
  • The initial set of contents are outside any HTML tag. This corresponds to the headers of the image file and has to be outside of HTML tag for the parsers to pick them as image. However, when we render the url as a HTML document, browser will try to render those junk characters even though they are outside any HTML tag. Hence, they are hidden explicitly on the HTML page with the help of CSS. body { visibility: hidden; }.
  • The actual image contents are placed inside a HTML comment block. Browsers will safely ignore any comments while rendering the page where as when we use it as an image the parsers will understand the contents and parse the image data.

Since the browsers are liberal enough to render the content based on the context, the same url can be used as a HTML document and as an image. Try saving the source and open in your favorite text editor, you will find junk characters. Rename the save file as jpeg, and open the file in browser, the contents will be rendered as image and not HTML.

Update: Just read through the comments on the Hacker news page. Looks like the same technique is used to compress JS files inside PNG files for JS1K contest.

-- Varun

google-site-verification: googlea4d68ed16ed2eea6.html