• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Want to get organized in 2022? Let Dokkio put your cloud files (Drive, Dropbox, and Slack and Gmail attachments) and documents (Google Docs, Sheets, and Notion) in order. Try Dokkio (from the makers of PBworks) for free. Available on the web, Mac, and Windows.



Page history last edited by PBworks 15 years, 7 months ago

Anthracite Idioms !


There are many ways to use Anthracite to accomplish web mining tasks. Here are a few common "idioms" used frequently in Anthracite.


Drag and Drop


Dragging a URL from the location bar of the browser is an easy way to start an Anthracite document with the source of the page you're looking at now, but it's also possible to drag a section of selected text into Anthracite to create a new Static Text source object.


Using this technique, one quick way to strip all tags from source html is to drag the selected html into Anthracite, run it through a "Strip Tags" processor object and into a result, and then copy the result back out and into whatever other application you're using.


    • UNIX


Anthracite is built in part on the UNIX power of MacOSX, so instead of having to write hard-to-maintain shell scripts to clean up text, you can use the same powerful commands but in an easy to understand visual interface.


For example, if you know how to create a ranked list of words in UNIX, you can use the same technique in Anthracite:


sort | uniq -c | sort -nr


Here's the same pipeline as an Anthracite process chain:



    • "The Two Step"


Many large spidering and scraping projects are more easily tackled with Anthracite in multiple steps. A very typical division of a task involves breaking it into two primary steps: 1) creating a list of URLs that need to be processed, and 2) Actually loading and parsing those URLs.


The distribution disk image contains two examples of this two-step processing, one in the "SEC10Qs" example, and the other in the "Small Biz" example.


in "SEC10Qs_StepOne" the SEC Edgar website is visited to collect the latest filing documents by form, then to find only the 10-Q forms among those, and format them for output and subsequent use.


Step two then loads those 10-Q filings, finds the company name, any text near the term "risk", and formats them into a report suitable for e-mailing to the user.


In "Small Biz Step One", a page of directory listings of individual company records is parsed to find three different ways of accesing the URLs.


In step two, each individual company information page is visited and from each dynamic webpage using a template to layout a database record of the company information, the software extracts the Name, Contact, Address and Category, then formats an HTML report showing all the company information found on a single webpage.


Metafy Homepage

Comments (0)

You don't have permission to comment on this page.