public class Crawler
extends Object
| Modifier and Type | Field and Description |
|---|---|
static int |
currentID
Current URL ID that is next up to be parsed.
|
static String |
domain
Domain to restrict our crawler to.
|
static int |
limit
Number of URLs to Parse.
|
static List<Page> |
parsed
List of parsed pages.
|
static Parser |
parser
Parser Object.
|
MyQueue |
toParse
Queue of pages that need to be parsed.
|
static int |
totalURLs
Total URLs enqueued.
|
static List<String> |
visited
(optional) This is a throw away List of visited URLs.
|
static List<Word> |
words
List of Word objects.
|
| Constructor and Description |
|---|
Crawler(String seed,
String domain,
int limit)
Constructor for Crawler where a seed URL and
page count limit are provided.
|
| Modifier and Type | Method and Description |
|---|---|
void |
addPageToList(Page p)
Add a Page to the parsed pages List
|
void |
addToQueue(String url)
Add a url to be parsed.
|
void |
addWordToList(String word,
int id)
Add a Word to the word postings list or increment if necessary.
|
void |
crawl()
Method to initiate and run the crawling process.
|
boolean |
isInDomain(String url)
Check that the candidate URL is in the specified
domain.
|
boolean |
isValidURL(String url)
Check that the URL begins with http:// or https://
|
boolean |
parse(Document doc,
int id)
Parse driver to manage parsing for links and text.
|
void |
parseLinks(Document doc)
Parse a Document object for available links
|
void |
parseText(Document doc,
int id)
Parse a Document for the body of text
|
public MyQueue toParse
public static List<Page> parsed
public static List<String> visited
public static List<Word> words
public static Parser parser
public static int currentID
public static int totalURLs
public static int limit
public static String domain
public Crawler(String seed,
String domain,
int limit)
seed - starting URL for the crawler.domain - root domain that the crawler is restricted to.limit - total number of URLs to crawl.public void crawl()
public boolean parse(Document doc,
int id)
doc - Document object for URlid - currentID that is being parsedpublic void parseLinks(Document doc)
doc - page to parsepublic void parseText(Document doc,
int id)
doc - page to parseid - urlID of the current page that is being parsedpublic void addWordToList(String word,
int id)
word - candidate word to be addedid - UrlID under considerationpublic void addToQueue(String url)
url - String url to be addedpublic void addPageToList(Page p)
p - Page to be addedpublic boolean isInDomain(String url)
url - candidate URL to checkpublic boolean isValidURL(String url)
url - candidate URL to check