Unstructured Text: How do I write my own parser? (for JSON)

This blog post first appeared on http://techblog.procurios.nl/k/n618/news/view/14605/14863/how-do-i-write-my-own-parser-(for-json).html

05 December 2008

If no parser is available for the file you need, writing one yourself may be easier than you think. What file-structures are managable? What would be the design of such a parser? How do you make sure it is complete? Here we describe the process for building a JSON parser in C#, and issue the source code.

By Patrick van Bergen

[Download the JSON parser / generator for C#]

The software is subject to the MIT license: you are free to use it in any way you like, but it must keep its license.

For our synchronisation-module (which we use to synchronize data between diverse business applications) we chose JSON for data exchange. JSON is just a little better suited for a PHP web-environment than XML, because:

The PHP functions json_encode() and json_decode() allow you to convert data structures from and to JSON strings
JSON can be sent directly to the browser in an Ajax request
It takes up less space than XML, which is important in server > browser traffic.
A JSON string can be composed of only ASCII characters, while still being able to express all UNICODE characters, thus avoiding all possible conversion issues a transport may carry.

So JSON is very convenient for PHP. But of course we wanted to be able to synchronize with Windows applications as well, and because C# is better suited to this environment, this part of the module was written in this language. The .Net framework just didn't have its own JSON parser / encoder and the open-source software written for this task often contained a whole package of classes and constraints and sometimes the JSON implementation wasn't even complete.

We just wanted a single class that could be imported and that used the most basic building blocks of our application: the ArrayList and the Hashtable. Also, all aspects of JSON should would have to be implemented, there should a JSON generator, and of course it should be fast.

More reasons to write our own parser weren't necessary. Writing a parser happens to be a very thing satisfying to do. It is the best way to learn a new programming language thoroughly. Especially if you're using unit-testing to guarantee the parser / generator matches the language specification exactly. JSON's specification is easy to find. The website http://www.json.org/ is as clear as one could wish for.

You start by writing the unit-tests. You should really write all test before starting the implementation, but such patience is seldomly found in a programmer. You can at least start by writing some obvious tests that help you to create a consistent API. This is an example of a simple object-test:

string json;
Hashtable o;
bool success = true;

json = "{\"name\":123,\"name2\":-456e8}";
o = (Hashtable)JSON.JsonDecode(json);
success = success && ((double)o["name"] == 123);
success = success && ((double)o["name2"] == -456e8);

Eventually you should write all tests needed to check all aspects of the language, because your users (other programmers) will assume that the parser just works.

OK. Parsers. Parsers are associated with specialized software: so called compiler compilers (of which Yacc is the most well known). Using this software will make sure that the parser will be fast, but it does not do all the work for you. What's more, it can be even easier to write the entire parser yourself than to do all the preparatoy work for the cc.

The compiler compiler is needed for languages with a high level of ambiguity. A language expression is parsed from-left-to-right. If a language contains many structures that cannot be identified at the start of te parse, it is advisable to use a tool that is able to manage the emerging complexity.

Unambiguous languages are better suitable for building the parser manually, using recursive functions to process the recursive nature of the language. The parser looks ahead one or more tokens to identify the next construct. For JSON it is even sufficient to look ahead a single token. This classifies it as an LL(1) language (see also http://en.wikipedia.org/wiki/LL_parser).

A parser takes as input a string of tokens. Tokens are the most elementary building blocks of a language, like "+", "{", "[", but also complete numbers like "-1.345e5" and strings like "'The scottish highlander looked around.'". The parse-phase is usually preceded by a tokenization phase. In our JSON parser this step is integrated in the parser, because to determine the next token, in almost all cases, it is enough to just read the next character in the string. This saves the allocation of a token table in memory.

The parser takes a string as input and returns a C# datastructure, consisting of ArrayLists, Hashtables, a number of scalar value types and null. The string is processed from left-to-right. An index (pointer) keeps track of the current position in the string at any moment. At each level of the parse process the parser performs these steps:

Look ahead 1 token to determine the type of the next construct
Choose the function to parse the construct
Call this function and integrate the returned value in the construct that is currently built.

A nice example is the recursive function "ParseObject" that parses an object:

protected Hashtable ParseObject(char[] json, ref int index)
{
  Hashtable table = new Hashtable();
  int token;

  // {
  NextToken(json, ref index);

  bool done = false;
  while (!done) {
    token = LookAhead(json, index);
    if (token == JSON.TOKEN_NONE) {
      return null;
    } else if (token == JSON.TOKEN_COMMA) {
      NextToken(json, ref index);
    } else if (token == JSON.TOKEN_CURLY_CLOSE) {
      NextToken(json, ref index);
      return table;
    } else {

      // name
      string name = ParseString(json, ref index);
      if (name == null) {
        return null;
      }

      // :
      token = NextToken(json, ref index);
      if (token != JSON.TOKEN_COLON) {
        return null;
      }

      // value
      bool success = true;
      object value = ParseValue(json, ref index, ref success);
      if (!success) {
        return null;
      }

      table[name] = value;
  }
}

return table;
}

The function is only called if a look ahead has determined that a construct starts with an opening curly brace. So this token may be skipped. Next, the string is parsed just as long as the closing brace is not found, or the end of the string is found (a syntax error, but one that needs to be caught). Between the braces there are a number of "'name': value" pairs, separated by comma's. This algorithm is can be found literally in the function, which makes it very insightful and thus easy to debug. The function builds an ArrayList and returns it to the calling function. The parser mainly consists of these types of functions.

If you create your own parser, you will always need to take into account that the incoming string may be grammatically incorrect. Users expect the parser to be able to tell on which line the error occurred. Our parser only remembers the index, but it also contains an extra function that returns the immediate context of the position of the error, comparable to the error messages that MySQL generates.

If you want to know more about parsers, it is good to know there consists a een standard work on this subject, that recently (2006) saw its second version:

Compilers: principles, techniques, and tools, Aho, A.V., Sethi, R. and Ullman ,J.D. (1986)

Unstructured Text

Monday, February 1, 2021

How do I write my own parser? (for JSON)

No comments:

Post a Comment

On SQLAlchemy

Report Abuse