blob: 1666752f6fe8deb7efa8035130da1e3b238d241f [file] [log] [blame]
***** Current status
Currently it runs all the way through, but the database.json has all
members[] lists empty. Most entries are skipped for "Suspect title";
some have ".pageText not found".
Currently only works on Linux; OS X (or other) will need minor path changes.
You will need a reasonably modern node.js installed.
0.5.9 is too old; 0.8.8 is not too old.
I needed to add my own "DumpRenderTree_resources/missingImage.gif",
for some reason.
For the reasons above, we're currently just using the checked-in
database.json from Feb 2012, but it has some bogus entries. In
particular, the one for UnknownElement would inject irrelevant German
text into our docs. So a hack in apidoc.dart (_mdnTypeNamesToSkip)
works around this.
***** Overview
Here's a rough walkthrough of how this works. The ultimate output file is
database.filtered.json.
full_run.sh executes all of the scripts in the correct order.
search.js
- read data/domTypes.json
- for each dom type:
- search for page on www.googleapis.com
- write search results to output/search/<type>.json
. this is a list of search results and urls to pages
crawl.js
- read data/domTypes.json
- for each dom type:
- for each output/search/<type>.json:
- for each result in the file:
- try to scrape that cached MDN page from webcache.googleusercontent.com
- write mdn page to output/crawl/<type><index of result>.html
- write output/crawl/cache.json
. it maps types -> search result page urls and titles
extract.sh
- compile extract.dart to js
- run extractRunner.js
- read data/domTypes.json
- read output/crawl/cache.json
- read data/dartIdl.json
- for each scraped search result page:
- create a cleaned up html page in output/extract/<type><index>.html that
contains the scraped content + a script tag that includes extract.dart.js.
- create an args file in output/extract/<type><index>.html.json with some
data on how that file should be processed
- invoke dump render tree on that file
- when that returns, parse the console output and add it to database.json
- add any errors to output/errors.json
- save output/database.json
extract.dart
- xhr output/extract/<type><index>.html.json
- all sorts of shenanigans to actually pull the content out of the html
- build a JSON object with the results
- do a postmessage with that object so extractRunner.js can pull it out
- run postProcess.dart
- go through the results for each type looking for the best match
- write output/database.html
- write output/examples.html
- write output/obsolete.html
- write output/database.filtered.json which is the best matches
***** Process for updating database.json using these scripts.
TODO(eub) when I get the scripts to work all the way through.