utils/apidoc/mdn/README.txt - sdk.git - Git at Google

 ***** Current status

 Currently it runs all the way through, but the database.json has all
 members[] lists empty.  Most entries are skipped for "Suspect title";
 some have ".pageText not found".

 Currently only works on Linux; OS X (or other) will need minor path changes.

 You will need a reasonably modern node.js installed.
 0.5.9 is too old; 0.8.8 is not too old.

 I needed to add my own "DumpRenderTree_resources/missingImage.gif",
 for some reason.

 For the reasons above, we're currently just using the checked-in
 database.json from Feb 2012, but it has some bogus entries.  In
 particular, the one for UnknownElement would inject irrelevant German
 text into our docs.  So a hack in apidoc.dart (_mdnTypeNamesToSkip)
 works around this.

 ***** Overview

 Here's a rough walkthrough of how this works. The ultimate output file is
 database.filtered.json.

 full_run.sh executes all of the scripts in the correct order.

 search.js
 - read data/domTypes.json
 - for each dom type:
   - search for page on www.googleapis.com
   - write search results to output/search/<type>.json
     . this is a list of search results and urls to pages

 crawl.js
 - read data/domTypes.json
 - for each dom type:
   - for each output/search/<type>.json:
     - for each result in the file:
       - try to scrape that cached MDN page from webcache.googleusercontent.com
       - write mdn page to output/crawl/<type><index of result>.html
 - write output/crawl/cache.json
   . it maps types -> search result page urls and titles

 extract.sh
 - compile extract.dart to js
 - run extractRunner.js
   - read data/domTypes.json
   - read output/crawl/cache.json
   - read data/dartIdl.json
   - for each scraped search result page:
     - create a cleaned up html page in output/extract/<type><index>.html that
       contains the scraped content + a script tag that includes extract.dart.js.
     - create an args file in output/extract/<type><index>.html.json with some
       data on how that file should be processed
     - invoke dump render tree on that file
     - when that returns, parse the console output and add it to database.json
     - add any errors to output/errors.json
   - save output/database.json

 extract.dart
 - xhr output/extract/<type><index>.html.json
 - all sorts of shenanigans to actually pull the content out of the html
 - build a JSON object with the results
 - do a postmessage with that object so extractRunner.js can pull it out

 - run postProcess.dart
   - go through the results for each type looking for the best match
   - write output/database.html
   - write output/examples.html
   - write output/obsolete.html
   - write output/database.filtered.json which is the best matches

 ***** Process for updating database.json using these scripts.

 TODO(eub) when I get the scripts to work all the way through.
	***** Current status

	Currently it runs all the way through, but the database.json has all
	members[] lists empty. Most entries are skipped for "Suspect title";
	some have ".pageText not found".

	Currently only works on Linux; OS X (or other) will need minor path changes.

	You will need a reasonably modern node.js installed.
	0.5.9 is too old; 0.8.8 is not too old.

	I needed to add my own "DumpRenderTree_resources/missingImage.gif",
	for some reason.

	For the reasons above, we're currently just using the checked-in
	database.json from Feb 2012, but it has some bogus entries. In
	particular, the one for UnknownElement would inject irrelevant German
	text into our docs. So a hack in apidoc.dart (_mdnTypeNamesToSkip)
	works around this.

	***** Overview

	Here's a rough walkthrough of how this works. The ultimate output file is
	database.filtered.json.

	full_run.sh executes all of the scripts in the correct order.

	search.js
	- read data/domTypes.json
	- for each dom type:
	- search for page on www.googleapis.com
	- write search results to output/search/<type>.json
	. this is a list of search results and urls to pages

	crawl.js
	- read data/domTypes.json
	- for each dom type:
	- for each output/search/<type>.json:
	- for each result in the file:
	- try to scrape that cached MDN page from webcache.googleusercontent.com
	- write mdn page to output/crawl/<type><index of result>.html
	- write output/crawl/cache.json
	. it maps types -> search result page urls and titles

	extract.sh
	- compile extract.dart to js
	- run extractRunner.js
	- read data/domTypes.json
	- read output/crawl/cache.json
	- read data/dartIdl.json
	- for each scraped search result page:
	- create a cleaned up html page in output/extract/<type><index>.html that
	contains the scraped content + a script tag that includes extract.dart.js.
	- create an args file in output/extract/<type><index>.html.json with some
	data on how that file should be processed
	- invoke dump render tree on that file
	- when that returns, parse the console output and add it to database.json
	- add any errors to output/errors.json
	- save output/database.json

	extract.dart
	- xhr output/extract/<type><index>.html.json
	- all sorts of shenanigans to actually pull the content out of the html
	- build a JSON object with the results
	- do a postmessage with that object so extractRunner.js can pull it out

	- run postProcess.dart
	- go through the results for each type looking for the best match
	- write output/database.html
	- write output/examples.html
	- write output/obsolete.html
	- write output/database.filtered.json which is the best matches

	***** Process for updating database.json using these scripts.

	TODO(eub) when I get the scripts to work all the way through.