public class Crawler
extends Object
Modifier and Type | Field and Description |
---|---|
static int |
currentID
Current URL ID that is next up to be parsed.
|
static String |
domain
Domain to restrict our crawler to.
|
static int |
limit
Number of URLs to Parse.
|
static List<Page> |
parsed
List of parsed pages.
|
static Parser |
parser
Parser Object.
|
MyQueue |
toParse
Queue of pages that need to be parsed.
|
static int |
totalURLs
Total URLs enqueued.
|
static List<String> |
visited
(optional) This is a throw away List of visited URLs.
|
static List<Word> |
words
List of Word objects.
|
Constructor and Description |
---|
Crawler(String seed,
String domain,
int limit)
Constructor for Crawler where a seed URL and
page count limit are provided.
|
Modifier and Type | Method and Description |
---|---|
void |
addPageToList(Page p)
Add a Page to the parsed pages List
|
void |
addToQueue(String url)
Add a url to be parsed.
|
void |
addWordToList(String word,
int id)
Add a Word to the word postings list or increment if necessary.
|
void |
crawl()
Method to initiate and run the crawling process.
|
boolean |
isInDomain(String url)
Check that the candidate URL is in the specified
domain.
|
boolean |
isValidURL(String url)
Check that the URL begins with http:// or https://
|
boolean |
parse(Document doc,
int id)
Parse driver to manage parsing for links and text.
|
void |
parseLinks(Document doc)
Parse a Document object for available links
|
void |
parseText(Document doc,
int id)
Parse a Document for the body of text
|
public MyQueue toParse
public static List<Page> parsed
public static List<String> visited
public static List<Word> words
public static Parser parser
public static int currentID
public static int totalURLs
public static int limit
public static String domain
public Crawler(String seed, String domain, int limit)
seed
- starting URL for the crawler.domain
- root domain that the crawler is restricted to.limit
- total number of URLs to crawl.public void crawl()
public boolean parse(Document doc, int id)
doc
- Document object for URlid
- currentID that is being parsedpublic void parseLinks(Document doc)
doc
- page to parsepublic void parseText(Document doc, int id)
doc
- page to parseid
- urlID of the current page that is being parsedpublic void addWordToList(String word, int id)
word
- candidate word to be addedid
- UrlID under considerationpublic void addToQueue(String url)
url
- String url to be addedpublic void addPageToList(Page p)
p
- Page to be addedpublic boolean isInDomain(String url)
url
- candidate URL to checkpublic boolean isValidURL(String url)
url
- candidate URL to check