Web-scraping with Java

This tutorial will introduce scraping websites (info) using Java. It'll talk a bit about the structure of webpages, before looking at how we pull them from the web and find info inside them.

First up, we need to understand the how webpages are written, and their general structure. Webpages are text files that contain a mix of text which you see on the page in a web broswer, and "tags" (also made of text) which you don't. Tags are interpreted by the web browser, rather than shown as text, and the tags tell the browser to do various things (insert images, make a link, etc.). You can view the complete text, tags and all, by picking your browser's "View Source" option, which you can usually get to by right-clicking a webpage.

The typical webpage source looks like that to the right. Here's this as a webpage: start.html.

Note that most tags, with the exception of <BR /> have opening and closing tags (closing tags are the same, but with a forward-slash before the name), and that tags nest inside other tags.

Note also that the format of the text in the file has no effect -- line breaks only occur in the final display where there's a <BR /> (line break) or a <P></P> (paragraph block).

<HTML> <HEAD> <TITLE>My first webpage</TITLE> </HEAD> <BODY> <P> This is some text<BR /> and a <A href="http://www.bbc.co.uk">link</A> </P> <IMG src="brushedsteel.jpg"></IMG> <BODY> </HTML>

If you're going to do any web scraping, you need to become familiar with HTML. You can find a basic tutorial here.

When we scrape the web, we need to identify components in our webpage we want to target. We may, for example, have a table like that to the right (here it is as a webpage). and want to pull out all of the second column data.

To do this, we need to identify the BODY, and look inside this to find the TABLE, then look in this to find each TR (table row), then in each row to find the second TD (table datacell).

This process is called navigating the "Document Object Model" (or "DOM"). We treat the webpage as if it were objects nested inside each other.

In addition to the main tag name, tags can have "attributes". You can see this with the A ("anchor") link tag, in the first page above -- it has an href (hypertext reference) attribute showing where to link to, and the IMG ("image") tag, which has a src (source) attribute giving the image filename.

There are two attributes all tags can have that help with scraping. One is the class attribute. This can be in multiple tags of the same name to divide them up into types. The other is the id attribute. This has to be unique to a specific tag. So you might see, for example:

<TD class=topRow id=topLeft>1</TD> <TD class=topRow id=topRight>4</TD>

This helps when navigating the DOM; we can ask for all TDs of class topRow or a specific TD with id rightTop.

You can see, we've give our TABLE an id, as an example.

<HTML> <BODY> <TABLE id="datatable"> <TR> <TH>A</TH><TH>B</TH> </TR> <TR> <TD>1</TD><TD>2</TD> </TR> <TR> <TD>3</TD><TD>4</TD> </TR> </TABLE> <BODY> </HTML>

Next, let's see how we download a page with Java, and navigate the DOM.