Pairify - how to match balanced string pairs

Photo by Grégoire Bertaud

I'm now actively working on a VSCode extension. I started it as a theme but then decided to add some more features. Like for example a tin line on the left side of the editor marking the current function scope. In order to do that I had to analyze the current file's code and find the lines that are included in that scope. The obvious approach will be to translate the code to AST and then traverse the tree finding the information that I need. This however will require the usage of a language server which now I don't want to deal with. So I decided to explore a brute force approach. Looping over the string characters and finding balanced matches. I quickly wrapped it into a library. I called it Pairify. It consumes text and returns an array of pairs. This blog post will show you how it works.

If you need this exact functionality and don't want to read the article but use the tool directly go to this place https://github.com/krasimir/pairify. An online interactive playground is available here https://pairify.now.sh/.

Setup & challenges

Let's first define what pairs we want to capture. I'm mainly writing JavaScript so those are brackets, comments and strings. We will start by defining a few key constants that will be used during the processing.

const CATEGORY_COMMENT = 'comment';
const CATEGORY_STRING = 'string';
const CATEGORY_BLOCK = 'block';

const COMMENTS = [
  { type: 'comment-single-line', start: '//', end: '\n ', category: CATEGORY_COMMENT },
  { type: 'comment-block', start: '/*', end: '*/', category: CATEGORY_COMMENT }
]
const STRINGS = [
  { type: 'single-quotes', start: '\'', end: '\'', category: CATEGORY_STRING },
  { type: 'double-quotes', start: '"', end: '"', category: CATEGORY_STRING },
  { type: 'template-literal', start: '`', end: '`', category: CATEGORY_STRING }
]
const BLOCKS = [
  { type: 'round', start: '(', end: ')', category: CATEGORY_BLOCK },
  { type: 'curly', start: '{', end: '}', category: CATEGORY_BLOCK },
  { type: 'square', start: '[', end: ']', category: CATEGORY_BLOCK },
  { type: 'angle', start: '<', end: '>', category: CATEGORY_BLOCK }
]

const ALL = COMMENTS.concat(STRINGS, BLOCKS);
const NEW_LINE = '\n';

Notice that we are splitting the pairs into groups and we are mostly interested into the BLOCKS one. Or to be more specific for decorating a function scope we need all the curly bracket matches.

There are couple of challenges:

We may have nested pairs.
Some characters have different meaning depending on the context. For example the > symbol is a closing token of the angle block type. However, that same character is used also when we define a fat arrow function. Having such function in a JSX tag attribute may confuse the analysis.
For some types the starting and ending characters are the same.
Some types need not one but two characters to be recognized. Like for example the single line comment //.
We need to count the number of lines so the \n will be used extensively for that. However one of the types comment-single-line also uses that as a closing token.

I'm completely aware that I can't cover all the possible edge cases with a simple for loop. The thing is that I don't need to. For the purpose of my extension is enough to have a high percentage of accuracy. Not 100%. For example, how many times you see a code like this:

<MyComponent something={ for(let i = 0; i > 10; i++) { } } />

And this will fail because the > in i > 10 will close the angle pair that starts with the very first character <. I decided to focus on more typical use cases like for example this one:

function getSomething() {
  return `foo } bar`;
}

Here the } symbol needs to be used as a closing token for the curly block. However, we have one inside the template literal that is returned by the function. This needs to be properly parsed because this happens often. We may have whatever closing token in a string. If you go to https://pairify.now.sh/ and paste that same code on the left side you'll see that we have three matched pairs:

[
  { "type":"round", "from":[1,22], "to":[1,24], "body":[21,2] },
  { "type":"template-literal", "from":[2,10], "to":[2,21], "body":[35,11] },
  { "type":"curly", "from":[1,25], "to":[3,2], "body":[24,25] }
]

Pairify successfully parsed the template literal ignoring the } symbol.

∞ Looping over

Here is how our analyze function begins:

function analyze(code) {
  let finds = [];
  for(let i = 0; i < code.length; i++) {
    // magic that pushes to `finds`
  }
  return finds;
}

This is the overall shape of our logic right. We receive a string and we read it character by character. There is some magic code that recognizes pairs and pushes them to an array. At the end we return the array.

Let's create a couple of other variables that will help us on the way. The COMMENTS, STRINGS and BLOCKS arrays above are not very easy to work with. That's because we have arrays of objects and we have to compare against a field in those objects. If we don't transform these three structures to something else we will have an additional loop for every single character. So, let's come up with something better:

// somewhere at the top
const ALL = COMMENTS.concat(STRINGS, BLOCKS);

// in the `analyze` function
let starters = ALL.reduce((res, token) => {
  res[token.start] = token;
  return res;
}, {});
let enders = ALL.reduce((res, token) => {
  res[token.end] = token;
  return res;
}, {});

After this code we have two nicely formatted maps starters and enders. A simple lookup like starters[<character>] will give us the exact token (if any).

On top of that we need to track the current line and the current position on that line. It is not the same as the i variable that we defined in the for loop. Consider the following code:

function test(a, b) {
  return a + b;
}

The + sign has a signature of 2:12 which means line 2, character position 12. While the i in this case will be 33.

So we will go with something like this:

let line = 1;
let position = 0;
for(let i = 0; i < code.length; i++) {
  const char = code[i];
  const nextChar = code[i + 1];

  if (char === NEW_LINE) {
    line += 1;
    position = 0;
  } else {
    position += 1;
    // magic that pushes to `finds`
  }
}

One last thing is the definition of a stack - let stack = []. It will help us deal with the nesting pairs. More about that in a bit.

The matching

Let's use the following line of code as an input:

console.log({ name: 'Dave' });

In here we have three balanced pairs. We have a round brackets marking the function call. We have curly bracket that represent the object with property name and single-quotes pair for the Dave string.

To construct a pair we have to find its starting and ending tokens. For the round bracket pair those are ( and ). When we encounter ( symbol we have to note somewhere that we are looking for the ending ). This place is our stack array.

For each character that we are processing we have to check if it is a starter or a ender. Based on that we will decide what to push or pull from the stack.

const char = code[i];
const nextChar = code[i + 1];
const starter = starters[char] || starters[char + nextChar] || null;
const ender = enders[char] || enders[char + nextChar] || null;

Notice that we are searching for one or two characters as a valid token. This is because of the COMMENTS pairs.

Let's begin with pushing stack items when a starter is captured:

if (starter) {
  stack.push({ type: starter.type, category: starter.category, line, position });
}

At the end of the loop, with the provided console.log({ name: 'Dave' }) string, we will have the following entries:

[
  { type: 'round', line: 1, position: 12 },
  { type: 'curly', line: 1, position: 13 },
  { type: 'single-quotes', line: 1, position: 21 },
  { type: 'single-quotes', line: 1, position: 26 }
]

Looking at that array we can spot the same-character-token problem. The single-quotes opening token is the same as its closing token. When matching such token we can't be sure if this is the start or the end of the matching pair. Here is the role of the stack. When encountering an ender we have to read the stack backwards and search for a starter with the same type. Only then we can conclude that we have a match. The simple check above need to become a bit more complicated.

if (stack.length > 0 && ender) {
    let foundInStack = false;
    for (let j = stack.length - 1; j >= 0; j--) {
      if (ender.type === stack[j].type) {
        foundInStack = true;
        const s = stack[j];
        stack.splice(j);
        finds.push({
          type: ender.type,
          from: [s.line, s.position],
          to: [line, position + 1],
        });
        break;
      }
    }
    if (!foundInStack && starter !== null) {
      stack.push({ type: starter.type, category: starter.category, line, position });
    }
  } else if (starter) {
    stack.push({ type: starter.type, category: starter.category, line, position });
  }
}

We first check if there are items in the stack and if we are meeting a closing token. If that is not the case we may directly push a starter (if any) in the stack. If there is a matching ender we loop back the stack until we find its starter. When we find it we splice the stack meaning that we delete all the items that we traversed so far. The following bit is also really important:

if (!foundInStack && starter !== null) {
  stack.push({ type: starter.type, line, position });
}

It covers the case of the same tokens. Like the single quotes in { name: 'Dave' }. When our script sees the first ' it recognizes it as a open but also as a closing token. However, foundInStack is false saying "Nope, there is no opening token with the same type". In this case if there is a starter we simply agree to push it to the stack.

Here is a short animation illustrating the process:

Pairify animation

Let's take another example where we have contextual matching. Consider the following code:

<Nav breadcrumbs="Home > User > Profile">

If we run it through our analyze function as it is now the result will be:

[
  { type: 'angle', from: [ 1, 1 ], to: [ 1, 25 ] }
]

This is basically <Nav breadcrumbs="Home > string. We have to somehow ignore the closing > when we are in a string. This may happen with defining a new flag inStringOrComment which we set to true.

let inStringOrComment = false;
for (let j = stack.length - 1; j >= 0; j--) {
  if (stack[j].category === CATEGORY_STRING) {
    inStringOrComment = true;
  }
  ...
}

Then later when we find identical types we consider it.

if (ender.type === stack[j].type) {
  if (ender.category !== CATEGORY_STRING && inStringOrComment) {
    break;
  }
  ...
}

ender.category !== CATEGORY_STRING is needed because otherwise we will break even the string pair closing.

The full, more complete implementation of the analyze function could be found here.

Final words

The code that we wrote so far is far from ideal. In fact, the actual implementation which is part of the Pairify library covers a bit more edge cases which we didn't even mention here. If you are planning to use this approach please first test your code here. If something is not complete or wrong then you probably have to use a more robust tooling like Esprima.