Crawler

Object
- Crawler

```
public class Crawler
extends Object
```
Class for managing crawling pages, calling parse methods, and storing the results.

Field Summary

Fields
Modifier and Type	Field and Description
`static int`	`currentID` Current URL ID that is next up to be parsed.
`static String`	`domain` Domain to restrict our crawler to.
`static int`	`limit` Number of URLs to Parse.
`static List<Page>`	`parsed` List of parsed pages.
`static Parser`	`parser` Parser Object.
`MyQueue`	`toParse` Queue of pages that need to be parsed.
`static int`	`totalURLs` Total URLs enqueued.
`static List<String>`	`visited` (optional) This is a throw away List of visited URLs.
`static List<Word>`	`words` List of Word objects.

Constructor Summary

Constructors
Constructor and Description

Crawler(String seed, String domain, int limit)
Constructor for Crawler where a seed URL and page count limit are provided.

Constructors
Constructor and Description
`Crawler(String seed, String domain, int limit)` Constructor for Crawler where a seed URL and page count limit are provided.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`addPageToList(Page p)` Add a Page to the parsed pages List
`void`	`addToQueue(String url)` Add a url to be parsed.
`void`	`addWordToList(String word, int id)` Add a Word to the word postings list or increment if necessary.
`void`	`crawl()` Method to initiate and run the crawling process.
`boolean`	`isInDomain(String url)` Check that the candidate URL is in the specified domain.
`boolean`	`isValidURL(String url)` Check that the URL begins with http:// or https://
`boolean`	`parse(Document doc, int id)` Parse driver to manage parsing for links and text.
`void`	`parseLinks(Document doc)` Parse a Document object for available links
`void`	`parseText(Document doc, int id)` Parse a Document for the body of text

Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - toParse
```
public MyQueue toParse
```
    Queue of pages that need to be parsed. You will be constantly enqueuing pages that you have not seen that will then be visited later.
  - parsed
```
public static List<Page> parsed
```
    List of parsed pages. When you have finished parsing a Page, you will add it to this List so that you can lookup Page objects from a URL ID when you are searching.
  - visited
```
public static List<String> visited
```
    (optional) This is a throw away List of visited URLs. You can use the default implementations of ArrayList to check if it contains a String or not.
  - words
```
public static List<Word> words
```
    List of Word objects. This is the postings list mentioned in the handout, where each Word will be represented by a Word object that itself maintains its own ArrayList of URL IDs that it has seen.
  - parser
```
public static Parser parser
```
    Parser Object. You will call methods from the Parser class using this object.
  - currentID
```
public static int currentID
```
    Current URL ID that is next up to be parsed.
  - totalURLs
```
public static int totalURLs
```
    Total URLs enqueued. The difference here is that they are waiting to be parsed, so currentID is still our bound against limit.
  - limit
```
public static int limit
```
    Number of URLs to Parse. When you bump up against this limit, you should stop adding items to your Queue.
  - domain
```
public static String domain
```
    Domain to restrict our crawler to. Example: a domain of cs.purdue.edu will prevent our Crawler from going to ece.purdue.edu, because the cs.purdue.edu substring does not exist in ece.purdue.edu
- Constructor Detail
  - Crawler
```
public Crawler(String seed,
               String domain,
               int limit)
```
    Constructor for Crawler where a seed URL and page count limit are provided. The pages should be crawled in this constructor and the result should be stored in Crawler.words and Crawler.parsed.
    
    Parameters:
    
    seed - starting URL for the crawler.
    
    domain - root domain that the crawler is restricted to.
    
    limit - total number of URLs to crawl.
- Method Detail
  - crawl
```
public void crawl()
```
    Method to initiate and run the crawling process.
  - parse
```
public boolean parse(Document doc,
                     int id)
```
    Parse driver to manage parsing for links and text.
    
    Parameters:
    
    doc - Document object for URl
    
    id - currentID that is being parsed
    
    Returns:
    
    true if parsing is successful, false otherwise
  - parseLinks
```
public void parseLinks(Document doc)
```
    Parse a Document object for available links
    
    Parameters:
    
    doc - page to parse
  - parseText
```
public void parseText(Document doc,
                      int id)
```
    Parse a Document for the body of text
    
    Parameters:
    
    doc - page to parse
    
    id - urlID of the current page that is being parsed
  - addWordToList
```
public void addWordToList(String word,
                          int id)
```
    Add a Word to the word postings list or increment if necessary.
    
    Parameters:
    
    word - candidate word to be added
    
    id - UrlID under consideration
  - addToQueue
```
public void addToQueue(String url)
```
    Add a url to be parsed. Should avoid duplicated URLs.
    
    Parameters:
    
    url - String url to be added
  - addPageToList
```
public void addPageToList(Page p)
```
    Add a Page to the parsed pages List
    
    Parameters:
    
    p - Page to be added
  - isInDomain
```
public boolean isInDomain(String url)
```
    Check that the candidate URL is in the specified domain. To simplify this, you can just check if the url contains the domain as a substring.
    
    Parameters:
    
    url - candidate URL to check
    
    Returns:
    
    true if in domain, false otherwise
  - isValidURL
```
public boolean isValidURL(String url)
```
    Check that the URL begins with http:// or https://
    
    Parameters:
    
    url - candidate URL to check
    
    Returns:
    
    true if valid, false otherwise

Class Crawler

Field Summary

Constructor Summary

Method Summary

Methods inherited from class Object

Field Detail

toParse

parsed

visited

words

parser

currentID

totalURLs

limit

domain

Constructor Detail

Crawler

Method Detail

crawl

parse

parseLinks

parseText

addWordToList

addToQueue

addPageToList

isInDomain

isValidURL