public class Crawler
extends Object
| Modifier and Type | Field and Description | 
|---|---|
| static int | currentIDCurrent URL ID that is next up to be parsed. | 
| static String | domainDomain to restrict our crawler to. | 
| static int | limitNumber of URLs to Parse. | 
| static List<Page> | parsedList of parsed pages. | 
| static Parser | parserParser Object. | 
| MyQueue | toParseQueue of pages that need to be parsed. | 
| static int | totalURLsTotal URLs enqueued. | 
| static List<String> | visited(optional) This is a throw away List of visited URLs. | 
| static List<Word> | wordsList of Word objects. | 
| Constructor and Description | 
|---|
| Crawler(String seed,
       String domain,
       int limit)Constructor for Crawler where a seed URL and
 page count limit are provided. | 
| Modifier and Type | Method and Description | 
|---|---|
| void | addPageToList(Page p)Add a Page to the parsed pages List | 
| void | addToQueue(String url)Add a url to be parsed. | 
| void | addWordToList(String word,
             int id)Add a Word to the word postings list or increment if necessary. | 
| void | crawl()Method to initiate and run the crawling process. | 
| boolean | isInDomain(String url)Check that the candidate URL is in the specified
 domain. | 
| boolean | isValidURL(String url)Check that the URL begins with http:// or https:// | 
| boolean | parse(Document doc,
     int id)Parse driver to manage parsing for links and text. | 
| void | parseLinks(Document doc)Parse a Document object for available links | 
| void | parseText(Document doc,
         int id)Parse a Document for the body of text | 
public MyQueue toParse
public static List<Page> parsed
public static List<String> visited
public static List<Word> words
public static Parser parser
public static int currentID
public static int totalURLs
public static int limit
public static String domain
public Crawler(String seed,
               String domain,
               int limit)
seed - starting URL for the crawler.domain - root domain that the crawler is restricted to.limit - total number of URLs to crawl.public void crawl()
public boolean parse(Document doc,
                     int id)
doc - Document object for URlid - currentID that is being parsedpublic void parseLinks(Document doc)
doc - page to parsepublic void parseText(Document doc,
                      int id)
doc - page to parseid - urlID of the current page that is being parsedpublic void addWordToList(String word,
                          int id)
word - candidate word to be addedid - UrlID under considerationpublic void addToQueue(String url)
url - String url to be addedpublic void addPageToList(Page p)
p - Page to be addedpublic boolean isInDomain(String url)
url - candidate URL to checkpublic boolean isValidURL(String url)
url - candidate URL to check