public class WebCrawler extends AbstractCrawler implements java.lang.Runnable, CrawlerAccess, Crawler
ContentHandler objects and a start page
(or several start pages) and an URLMask. Files are retrieved with URLs and the
content is then passed to one of the given ContentHandlers according to the
content-type of the URL. If no suitable ContentHandler is found, the file is
simply discarded.
ContentHandlers may add other pages to the crawl queue according to the links they find in the content they
are processing. The WebCrawler will take care that the same page is not crawled more than once.
WebCrawer implements Runnable and can be used with multiple Threads. Simply initialize the WebCraler object.
Then create as many Threads with the single WebCrawler instance as you want and start each with Thread.start().
Use Thread.isAlive() to tell if the thread is still running.
Using more than one thread can be more efficient because threads can download and parse pages simultaneously.ContentHandler,
URLMask,
CrawlerAccess| Modifier and Type | Field and Description |
|---|---|
private boolean |
checkDonePages |
private int |
counter |
private java.util.HashSet |
donePages |
static int |
HTTP_UNAUTHORIZED_INTERRUPTION |
protected int |
numThreads |
private java.util.HashMap |
pagesDone |
private int |
processing |
private java.util.LinkedList<Tuples.T2<java.lang.Object,java.lang.Integer>> |
queue |
private java.util.HashMap |
timeTaken |
forceExit, handleCount| Constructor and Description |
|---|
WebCrawler()
Creates new WebCrawler
|
| Modifier and Type | Method and Description |
|---|---|
void |
add(java.lang.Object crawlObject,
int depth)
Adds an url to the queue of the crawler.
|
void |
addObject(java.lang.Object data)
Gives any object constructed from the crawled page to the call back object.
|
void |
clearQueue()
Clears the crawl queue.
|
void |
crawl()
Starts crawling the pages added to queue with
addPageToQueue. |
void |
crawl(java.lang.Object crawlObject)
Starts crawling by first adding the given page to the queue.
|
void |
crawl(java.lang.Object crawlObject,
int depth)
Starts crawling by first adding the given page to the queue.
|
java.util.HashSet |
getDonePages()
Gets the HashMap that contains pages that have already been crawled.
|
java.util.HashMap |
getPagesDone() |
java.util.HashMap |
getTimeTaken() |
void |
loadSettings(org.w3c.dom.Element rootElement) |
static void |
main(java.lang.String[] args) |
int |
pagesProcessed() |
void |
run()
The Runnable implementation.
|
void |
setDonePages(java.util.HashSet hm)
Sets the HashMap that contains pages that have already been crawled.
|
addHandler, addInterruptHandler, createObject, forceExit, getCallBack, getCrawlCounter, getHandledDocumentCount, getHandler, getInterruptHandler, getMask, getProperty, isVerbose, loadSettings, loadSettings, modifyCrawlCounter, setCallBack, setCrawlCounter, setMask, setProperty, setVerboseclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitforceExit, setPropertypublic static final int HTTP_UNAUTHORIZED_INTERRUPTION
private java.util.HashSet donePages
private java.util.LinkedList<Tuples.T2<java.lang.Object,java.lang.Integer>> queue
private int counter
private int processing
private boolean checkDonePages
private java.util.HashMap pagesDone
private java.util.HashMap timeTaken
protected int numThreads
public void clearQueue()
public void crawl(java.lang.Object crawlObject)
public void crawl(java.lang.Object crawlObject,
int depth)
public void crawl()
addPageToQueue.public void run()
run in interface java.lang.Runnablepublic void add(java.lang.Object crawlObject,
int depth)
CrawlerAccessadd in interface CrawlerAccesspublic void addObject(java.lang.Object data)
CrawlerAccessCrawlerAccess implementation to decide what to do with it.addObject in interface CrawlerAccesspublic void setDonePages(java.util.HashSet hm)
setDonePages and getDonePages you can setup
multiple Crawlers that don't crawl pages that some other
Crawler have already processed.public java.util.HashSet getDonePages()
public int pagesProcessed()
public void loadSettings(org.w3c.dom.Element rootElement)
throws java.lang.Exception
loadSettings in class AbstractCrawlerjava.lang.Exceptionpublic java.util.HashMap getPagesDone()
public java.util.HashMap getTimeTaken()
public static void main(java.lang.String[] args)
throws java.lang.Exception
java.lang.ExceptionCopyright 2004-2015 Wandora Team