Overhaul link and emphasis resolution (#345)

* Overhaul link and emphasis resolution

Resolution of complex link and emphasis text follows very specific rules which
were incompatible with the currenty TagState stack. The new algorithms follow
the process outlined in the [CommonMark
spec](https://spec.commonmark.org/0.29/#an-algorithm-for-parsing-nested-emphasis-and-links).

The crux of the issue which required such an overhaul is that the current
TagState stack did not include any ability to wait to parse a tag's inner text
until it was known that a tag could be closed at the current position, then
parse that inner text, then close the tag. This unfortunately requires a
breaking change for downstream packages which subclass TagSyntax.

* BREAKING: TagSyntax constructor no longer takes an `end` parameter. TagSyntax
  no longer implements `onMatchEnd`. Instead, TagSyntax implements a method
  called 'close' which creates and returns a Node, if a Node can be created and
  closed at the current position. If the TagSyntax instance cannot create a
  Node at the current position, the method should return `null`. Some TagSyntax
  subclasses will unconditionally create a tag in `close`, while others may be
  unable to, such as LinkSyntax, if an inline or reference link could not be
  resolved.

* Loosely, the stack of TagStates is replaced with a stack of Delimiters and a
  tree of parsed HTML nodes.
* Emphasis and strong emphasis, link and image open delimiters are handled with
  the "look for link or image" and "process emphasis" algorithms.
* We combine adjacent text in a more intentional way, and likely more efficient
  manner.
* The _DelimiterRun class is replaced with three classes: abstract Delimiter
  and subclasses SimpleDelimiter, and DelimiterRun.

These changes result in no new spec failures. Emphasis compliance rises from
96% to 99%. Link compliance rises from 90% to 93%. Total CommonMark compliance
rises from 93% to 94%. Total GFM compliance rises from 92% to 93%.

* documentation and simplification

* Fix test

* revert gitignore

* bump to 4.0.0-dev
14 files changed
tree: 32abcadcb5c7d26c24e8dd60b96504f8ed3f647b
  1. benchmark/
  2. bin/
  3. example/
  4. lib/
  5. test/
  6. tool/
  7. .gitignore
  8. .travis.yml
  9. analysis_options.yaml
  10. AUTHORS
  11. build.yaml
  12. CHANGELOG.md
  13. LICENSE
  14. pubspec.yaml
  15. README.md
README.md

Build Status

A portable Markdown library written in Dart. It can parse Markdown into HTML on both the client and server.

Play with it at dart-lang.github.io/markdown.

Usage

import 'package:markdown/markdown.dart';

void main() {
  print(markdownToHtml('Hello *Markdown*'));
  //=> <p>Hello <em>Markdown</em></p>
}

Syntax extensions

A few Markdown extensions, beyond what was specified in the original Perl Markdown implementation, are supported. By default, the ones supported in CommonMark are enabled. Any individual extension can be enabled by specifying an Array of extension syntaxes in the blockSyntaxes or inlineSyntaxes argument of markdownToHtml.

The currently supported inline extension syntaxes are:

  • new InlineHtmlSyntax() - approximately CommonMark's definition of “Raw HTML”.

The currently supported block extension syntaxes are:

  • const FencedCodeBlockSyntax() - Code blocks familiar to Pandoc and PHP Markdown Extra users.
  • const HeaderWithIdSyntax() - ATX-style headers have generated IDs, for link anchors (akin to Pandoc's auto_identifiers).
  • const SetextHeaderWithIdSyntax() - Setext-style headers have generated IDs for link anchors (akin to Pandoc's auto_identifiers).
  • const TableSyntax() - Table syntax familiar to GitHub, PHP Markdown Extra, and Pandoc users.

For example:

import 'package:markdown/markdown.dart';

void main() {
  print(markdownToHtml('Hello <span class="green">Markdown</span>',
      inlineSyntaxes: [new InlineHtmlSyntax()]));
  //=> <p>Hello <span class="green">Markdown</span></p>
}

Extension sets

To make extension management easy, you can also just specify an extension set. Both markdownToHtml() and Document() accept an extensionSet named parameter. Currently, there are four pre-defined extension sets:

  • ExtensionSet.none includes no extensions. With no extensions, Markdown documents will be parsed with a default set of block and inline syntax parsers that closely match how the document might be parsed by the original Perl Markdown implementation.

  • ExtensionSet.commonMark includes two extensions in addition to the default parsers to bring the parsed output closer to the CommonMark specification:

    • Block Syntax Parser

      • const FencedCodeBlockSyntax()
    • Inline Syntax Parser

      • InlineHtmlSyntax()
  • ExtensionSet.gitHubFlavored includes five extensions in addition to the default parsers to bring the parsed output close to the GitHub Flavored Markdown specification:

    • Block Syntax Parser

      • const FencedCodeBlockSyntax()
      • const TableSyntax()
    • Inline Syntax Parser

      • InlineHtmlSyntax()
      • StrikethroughSyntax()
      • AutolinkExtensionSyntax()
  • ExtensionSet.gitHubWeb includes eight extensions. The same set of parsers use in the gitHubFlavored extension set with the addition of the block syntax parsers, HeaderWithIdSyntax and SetextHeaderWithIdSyntax, which add id attributes to headers and inline syntac parser, EmojiSyntax, for parsing GitHub style emoji characters:

    • Block Syntax Parser

      • const FencedCodeBlockSyntax()
      • const HeaderWithIdSyntax(), which adds id attributes to ATX-style headers, for easy intra-document linking.
      • const SetextHeaderWithIdSyntax(), which adds id attributes to Setext-style headers, for easy intra-document linking.
      • const TableSyntax()
    • Inline Syntax Parser

      • InlineHtmlSyntax()
      • StrikethroughSyntax()
      • EmojiSyntax()
      • AutolinkExtensionSyntax()

Custom syntax extensions

You can create and use your own syntaxes.

import 'package:markdown/markdown.dart';

void main() {
  var syntaxes = [new TextSyntax('nyan', sub: '~=[,,_,,]:3')];
  print(markdownToHtml('nyan', inlineSyntaxes: syntaxes));
  //=> <p>~=[,,_,,]:3</p>
}

HTML sanitization

This package offers no features in the way of HTML sanitization. Read Estevão Soares dos Santos's great article, “Markdown's XSS Vulnerability (and how to mitigate it)”, to learn more.

The authors recommend that you perform any necessary sanitization on the resulting HTML, for example via dart:html's NodeValidator.

CommonMark compliance

This package contains a number of files in the tool directory for tracking compliance with CommonMark.

Updating CommonMark stats when changing the implementation

  1. Update the library and test code, making sure that tests still pass.
  2. Run dart tool/stats.dart --update-files to update the per-test results tool/common_mark_stats.json and the test summary tool/common_mark_stats.txt.
  3. Verify that more tests now pass – or at least, no more tests fail.
  4. Make sure you include the updated stats files in your commit.

Updating the CommonMark test file for a spec update

  1. Check out the CommonMark source. Make sure you checkout a major release.

  2. Dump the test output overwriting the existing tests file.

    > cd /path/to/common_mark_dir
    > python3 test/spec_tests.py --dump-tests > \
      /path/to/markdown.dart/tool/common_mark_tests.json
    
  3. Update the stats files as described above. Note any changes in the results.

  4. Update any references to the existing spec by search for https://spec.commonmark.org/0.28 in the repository. (Including this one.) Verify the updated links are still valid.

  5. Commit changes, including a corresponding note in CHANGELOG.md.