Let’s say you want to find some information inside a web page and display it in a custom way in your app.
This technique is called “scraping.” Let’s also assume you’ve
thought through alternatives to scraping web pages from inside your app,
and are pretty sure that’s what you want to do.
Well then you get to the question – how can you programmatically dig
through the HTML and find the part you’re looking for, in the most
robust way possible? Believe it or not, regular expressions won’t cut it!
Well, in this tutorial you’ll find out how! You’ll get hands-on
experience with parsing HTML into an Objective-C data model that your
apps can use.
In fact, you’ll work with some HTML from this very site, downloading a
list of tutorials and also a list of the members of the iOS Tutorial
Team (who are quite awesome, if I do say so myself).
Even if you are pretty sure you never want to parse HTML in your
apps, you might enjoy this tutorial anyway, because it covers some cool
things you can do with XML and querying its elements with XPath.
This tutorial assumes some familiarity with Objective-C and iOS
programming. If you are a complete beginner, you may wish to check out
some of the other tutorials on this site.
Before you begin parsing/scraping web pages in your app, you should
first make sure that this is really the best choice for you.
Do you own the content? If you own the content you’re
scraping, no problem. But if you don’t, it’s kinda dodgy legally. You
can probably get away with it but there are serious questions here. And
remember, IANAL – so for more reading, check out this, this, or even better – check with your lawyer.
What if the content changes? Since web pages are dynamic
things, remember at the format of the page could change at any time. If
you’ve hard-baked assumptions on the format of the web page within your
app, it might break your app, and you’d have to wait through the long
App Store review cycle to get it fixed.
Should you be using a web service instead? If you really
want to be getting data from the web, you should consider using a web
service instead that can return you nice, beautiful XML or JSON. You
could even put the scraping code on the server side instead of in your
app. This has a number of advantages, including offloading the heavy
lifting to the server side, and only having to perform it 1 time,
instead of 1 time per running app instance.
So – assuming you’ve thought this all through, and you’re *really
sure* this is what you want to do, here’s how! But don’t say we didn’t
warn you ;]
Getting Started: How to Climb Trees
As you’re probably aware, HTML (HyperText Markup Language) is a
markup language (it’s in the name!) that tells browsers how to layout a
web page. By its very nature, this content is in a hierarchy that
defines where within the page a piece of information is to be displayed.
You may also be aware of XML (eXtensible Markup Language). This also
defines a hierarchy of information, and you may at this point be
thinking that perhaps HTML is related to XML. You’d be right to think
that, and also wrong!
There are two flavors of HTML: the one that is pure XML, and the
original, where-it-all-started HTML. You can read about the difference
over at Wikipedia,
but it’s sufficient for the purposes of this tutorial to know that HTML
is “sort of” an XML document, but with more relaxed rules.
Since an XML document has a natural hierarchy in a tree structure, it
makes sense to have some kind of language to describe retrieving
portions of that tree. This is where XPath
comes in. XPath is a language for selecting portions of an XML
document. Fortunately for you, it works just as well with an HTML
document.
For example, consider this portion of HTML:
<html>
<head>
<title>Some webpage</title>
</head>
<body>
<p class=”normal”>This is the first paragraph</p>
<p class=”special”>This is the second paragraph. <b>This is in bold.</b></p>
</body>
</html> |
This clearly is in a tree structure that looks like this:
Based on the above diagram, if you wanted to access the title of the
HTML document, then you could use the following XPath expression to walk
the tree and return the corresponding node:
This would yield a node with just one child: the text “Some webpage.”
Similarly, if you wanted to access the second paragraph, you could use the following XPath expression:
/html/body/p[@class=‘special’] |
This would give you access to the node that represents the portion of
the tree underneath <p class=’special’>. Note that you have used
the syntax [@class=‘special’] to say that you want the nodes
which are at html -> body -> p, where the <p> tag has the
“class” attribute set to “special.” If there were more than one
<p> tag with that class, then this expression would have returned
an array of the nodes. But in this case, there’s only one.
With that knowledge in hand, you can now write XPath queries to access anything within the tree!
Getting Started for Real: Libxml2 and Hpple
Parsing an XML document into a manageable format is a pretty complex
process. But never fear, there is a handy little library that’s included
in the iOS SDK called libxml2.
This may sound scary at first. A C library without a pretty Objective-C wrapping?
Fortunately, thanks to some excellent development, there is an open source library called hpple
that wraps libxml2 nicely using Objective-C objects. Hpple wraps the
creation of the XML document structure, as well as the XPath querying.
While you may feel like you have the hiccups every time you see the
word, in this tutorial you will be using hpple to parse HTML.
Start Xcode and go to File\New\Project, select iOS\Application\Master
Detail Application and click Next. Set up the project like so:
- Project name: HTMLParsing
- Company Identifier: Your usual reverse DNS identifier
- Class Prefix: Leave blank
- Device Family: iPhone
- Use Storyboards: No
- Use Automatic Reference Counting: Yes
- Include Unit Tests: No (you’re living life on the edge)
Click Next and finally, choose a location to save your project.
Creating the Data Model
You’re going to be downloading tutorials and contributor names from
raywenderlich.com, so it would be nice to have these objects modeled in
an Objective-C class for easy access. I know you like to keep your
project organized, so create a group in the project called Model under
the root HTMLParsing group like so (right-click on the HTMLParsing folder to get the context menu):
Next create a new file under the Model group by selecting Model, then
clicking File\New\File (or right-clicking on the folder and selecting
New File…). Select Cocoa Touch\Objective-C class and click Next. Enter
“Tutorial” as the class, and make it a subclass of NSObject. Finally,
click Next and save it along with the rest of the project.
Now select Tutorial.h and make the interface look like this:
@interface Tutorial : NSObject
@property (nonatomic, copy) NSString *title;
@property (nonatomic, copy) NSString *url;
@end |
Then select Tutorial.m and make the implementation look like this:
@implementation Tutorial
@synthesize title = _title;
@synthesize url = _url;
@end |
Now create another class, again under the Model group, and call it
“Contributor.” Like before, make it a subclass of NSObject. Then make
the interface and implementation look like the following:
// Interface
@interface Contributor : NSObject
@property (nonatomic, copy) NSString *name;
@property (nonatomic, copy) NSString *imageUrl;
@end
// Implementation
@implementation Contributor
@synthesize name = _name;
@synthesize imageUrl = _imageUrl;
@end |
Adding the Hpple Code
Note: If you are comfortable with git, then you might want
to consider doing the following by cloning the git repository locally,
rather than downloading the ZIP file.
The hpple project is hosted on GitHub, so open a browser and point it to https://github.com/topfunky/hpple.
Click the “ZIP” button to download a ZIP file containing the project.
Unzip it and open the resulting folder in Finder. You should see
something like this:
Now create another group under the HTMLParsing group called
“hpple,” and drag the TFHpple.h/.m, TFHppleElement.h/.m and
XPathQuery.h/.m files to the newly created group. When you do this, make
sure that you opt to copy the files to the destination group’s folder
and add them to the HTMLParsing target:
Since hpple makes use of libxml2, you need to tell your project where
to find the libxml2 headers, and also to link against it when building.
To do this, select the project root at the top of the project
navigator, go to Build Settings and search for “header search paths.”
Enter the value for the Header Search Paths row as $(SDKROOT)/usr/include/libxml2 and press Enter. It should end up looking like this:
Next select Build Phases and open the “Link Binary With Libraries” section. Click the (+) button and search for libxml2. Select libxml2.dylib and press Add. The project navigator should now look like this:
If you build and run the project now, everything should compile and
link, and you’ll be presented with the standard app that’s created with
the Master-Detail Application template you opted to use:
Sit on Your Arse and Parse
Now that everything is set up, go ahead and parse some HTML! Your first trick will be to parse http://www.raywenderlich.com/tutorials
for a list of tutorials. If you open the site’s homepage in your
favorite browser and view the source of the page, you should find
something in there like this:
<div class="content-wrapper">
<h3>Beginning iPhone Programming</h3>
<ul>
<li><a href="/?p=1797">How To Create a Simple iPhone App on iOS 5 Tutorial: 1/3</a></li>
<li><a href="/?p=1845">How To Create a Simple iPhone App on iOS 5 Tutorial: 2/3</a></li>
<li><a href="/?p=1888">How To Create a Simple iPhone App on iOS 5 Tutorial: 3/3</a></li>
<li><a href="/?p=10209">My App Crashed – Now What? 1/2</a></li>
<li><a href="/?p=10505">My App Crashed – Now What? 2/2</a></li>
<li><a href="/?p=8003">How to Submit Your App to Apple: From No Account to App Store, Part 1</a></li>
<li><a href="/?p=8045">How to Submit Your App to Apple: From No Account to App Store, Part 2</a></li>
</ul>
</div> |
Note: A lot of irrelevant code has been trimmed out for clarity.
If you draw that in tree format, you come up with something like this:
It should be clear that you can obtain all the tutorials by finding
all the <a> tags within the <li> tags, which are under
<ul> tags, which are in the <div> tag with
“class=’content-wrapper’.” An XPath expression that obtains these is:
//div[@class='content-wrapper']/ul/li/a |
Note: The double slash (//) at the front means “search
anywhere in the document for the following tag.” This stops you having
to go right from the top of the tree down through html, then body, etc.
Having located all of the <a> tags, you will then be interested
in the “href” attributes of the <a> tags, and also the text
contents within.
Open MasterViewController.m and add the following imports at the top, since you will need to use these classes later on:
#import "TFHpple.h"
#import "Tutorial.h"
#import "Contributor.h" |
Next, add the following method above initWithNibName:bundle:, which will load the list of tutorials from raywenderlich.com:
-(void)loadTutorials {
// 1
NSURL *tutorialsUrl = [NSURL URLWithString:@"http://www.raywenderlich.com/tutorials"];
NSData *tutorialsHtmlData = [NSData dataWithContentsOfURL:tutorialsUrl];
// 2
TFHpple *tutorialsParser = [TFHpple hppleWithHTMLData:tutorialsHtmlData];
// 3
NSString *tutorialsXpathQueryString = @"//div[@class='content-wrapper']/ul/li/a";
NSArray *tutorialsNodes = [tutorialsParser searchWithXPathQuery:tutorialsXpathQueryString];
// 4
NSMutableArray *newTutorials = [[NSMutableArray alloc] initWithCapacity:0];
for (TFHppleElement *element in tutorialsNodes) {
// 5
Tutorial *tutorial = [[Tutorial alloc] init];
[newTutorials addObject:tutorial];
// 6
tutorial.title = [[element firstChild] content];
// 7
tutorial.url = [element objectForKey:@"href"];
}
// 8
_objects = newTutorials;
[self.tableView reloadData];
} |
This might look scary, but let’s break it down and see what’s going on:
- First you need to download the web page, so you create an NSURL with
the appropriate URL string. Then you create an NSData object with the
contents of that URL. This means “tutorialsHtmlData” will contain the
entire HTML document in raw data form.
If you wanted, you could create an NSString from this using NSString’s alloc/initWithData:usingEncoding: to see the data. It would be the same as if you were to “view source” in your browser.
Note:
dataWithContentsOfURL: will block until the data has been returned.
This means that the UI will become unresponsive until the data is
fetched from the server. A better approach is to use NSURLConnection to
asynchronously grab the data, but that’s beyond the scope of this
tutorial.
- Next you create a TFHpple parser with the data that you downloaded.
- Then you set up the appropriate XPath query and ask the parser to
search using the query. This will return an array of nodes (in hpple
land, these are TFHppleElement objects).
- Then you create an array to hold your new tutorial objects and loop over the obtained nodes.
- Inside the loop, you first create a new Tutorial object and add it to the array.
- Then you get the tutorial’s title from the node’s first child’s
contents. If you look back at the tree, you should be able to see that
this is the case.
- Then you get the tutorial’s URL from the “href” attribute of the
node. It’s an <a> tag, so it gives you the linking URL. In our
case, this is the tutorial’s URL.
- Finally you set _objects on the view controller to the new tutorials array you created, and ask the table view to reload its data.
Before you build and run, do some spring cleaning on this class to
remove some of the default behavior of the template project. Remove the insertNewObject: method, and change viewDidLoad to look like:
-(void)viewDidLoad {
[super viewDidLoad];
[self loadTutorials];
} |
Make the tableView:cellForRowAtIndexPath: look like this:
-(UITableViewCell *)tableView:(UITableView *)tableView cellForRowAtIndexPath:(NSIndexPath *)indexPath {
static NSString *CellIdentifier = @"Cell";
UITableViewCell *cell = [tableView dequeueReusableCellWithIdentifier:CellIdentifier];
if (cell == nil) {
cell = [[UITableViewCell alloc] initWithStyle:UITableViewCellStyleSubtitle reuseIdentifier:CellIdentifier];
cell.accessoryType = UITableViewCellAccessoryDisclosureIndicator;
}
Tutorial *thisTutorial = [_objects objectAtIndex:indexPath.row];
cell.textLabel.text = thisTutorial.title;
cell.detailTextLabel.text = thisTutorial.url;
return cell;
} |
Here we simply set the main label and the detail label to the tutorial title and URL.
Build and run, and you should be greeted with a list of tutorials!
The Fellowship of the Tutorial
Next up, you’ll be downloading the list of Ray’s contributors, i.e. the fellowship of the iOS Tutorial Team. If you open http://www.raywenderlich.com/about in your favorite browser and “View Source” again, somewhere in the file you should see something like this:
<ul class="team-members">
<li id='mgalloway'>
<h3>Matt Galloway (Editor, Tutorial Team Member)</h3>
<img src='/wp-content/images/authors/mgalloway.jpg' alt='Matt Galloway' width='100' height='100'>
</li>
</ul> |
In a tree structure, it looks like this:
This time, your corresponding XPath expression looks like this:
//ul[@class='team-members']/li |
This translates to: get me all the <li> tags which are children of a <ul> tag that has “class=’team-members’”.
Back in MasterViewController.m, add an instance variable to
the class continuation category as follows (the class continuation
category is the section beginning with @interface MasterViewController
() at the top of the file, right below the imports section):
You will be adding the contributors to the new _contributors array.
Next add the following method below loadTutorials in MasterViewController.m:
-(void)loadContributors {
// 1
NSURL *contributorsUrl = [NSURL URLWithString:@"http://www.raywenderlich.com/about"];
NSData *contributorsHtmlData = [NSData dataWithContentsOfURL:contributorsUrl];
// 2
TFHpple *contributorsParser = [TFHpple hppleWithHTMLData:contributorsHtmlData];
// 3
NSString *contributorsXpathQueryString = @"//ul[@class='team-members']/li";
NSArray *contributorsNodes = [contributorsParser searchWithXPathQuery:contributorsXpathQueryString];
// 4
NSMutableArray *newContributors = [[NSMutableArray alloc] initWithCapacity:0];
for (TFHppleElement *element in contributorsNodes) {
// 5
Contributor *contributor = [[Contributor alloc] init];
[newContributors addObject:contributor];
// 6
for (TFHppleElement *child in element.children) {
if ([child.tagName isEqualToString:@"img"]) {
// 7
@try {
contributor.imageUrl = [@"http://www.raywenderlich.com" stringByAppendingString:[child objectForKey:@"src"]];
}
@catch (NSException *e) {}
} else if ([child.tagName isEqualToString:@"h3"]) {
// 8
contributor.name = [[child firstChild] content];
}
}
}
// 9
_contributors = newContributors;
[self.tableView reloadData];
} |
This should look familiar. That’s because it’s very similar to the
loadTutorials method you wrote! This time, though, there’s a slightly
more work that needs to be done to extract the relevant information
about the contributors. Here’s what it all means:
- Same as before, except this time grabbing a different URL. This
time we’re using the main page, so we can get the list of the cool guys
and gals on the sidebar.
- Again, creating a TFHpple parser.
- Execute your desired XPath query.
- Create a new array and loop over the found nodes.
- Create a new Contributor object and add it to your array.
- You need to get at the name and image URL elements from the
<h3> and <img> tags, which are children of the <li>
tag. So you loop over the children and pull out the relevant details as
you find them.
- If this child is an <img> tag, then the “src” attribute tells you the image URL. Note that this is wrapped in a
@try{}@catch{}
because sometimes internally within hpple, an exception is thrown. I hope that will be fixed upstream at some point. - If this child is a <h3> tag, then the first child (the text node) will tell you the name of the contributor.
- As before, set the view controller’s _contributors array to the new one you created, and reload the table data.
All that’s left is to make the table view display the new contributor
data. Change the following table view data source methods to look like
this:
-(NSString*)tableView:(UITableView *)tableView titleForHeaderInSection:(NSInteger)section {
switch (section) {
case 0:
return @"Tutorials";
break;
case 1:
return @"Contributors";
break;
}
return nil;
}
-(NSInteger)numberOfSectionsInTableView:(UITableView *)tableView {
return 2;
}
-(NSInteger)tableView:(UITableView *)tableView numberOfRowsInSection:(NSInteger)section {
switch (section) {
case 0:
return _objects.count;
break;
case 1:
return _contributors.count;
break;
}
return 0;
}
-(UITableViewCell *)tableView:(UITableView *)tableView cellForRowAtIndexPath:(NSIndexPath *)indexPath {
static NSString *CellIdentifier = @"Cell";
UITableViewCell *cell = [tableView dequeueReusableCellWithIdentifier:CellIdentifier];
if (cell == nil) {
cell = [[UITableViewCell alloc] initWithStyle:UITableViewCellStyleSubtitle reuseIdentifier:CellIdentifier];
cell.accessoryType = UITableViewCellAccessoryDisclosureIndicator;
}
if (indexPath.section == 0) {
Tutorial *thisTutorial = [_objects objectAtIndex:indexPath.row];
cell.textLabel.text = thisTutorial.title;
cell.detailTextLabel.text = thisTutorial.url;
} else if (indexPath.section == 1) {
Contributor *thisContributor = [_contributors objectAtIndex:indexPath.row];
cell.textLabel.text = thisContributor.name;
}
return cell;
} |
Finally, add the following at the bottom of your viewDidLoad:
Then build and run. You should see a list of not only the tutorials but also the contributors! Great work!
Where to Go From Here?
Here is a sample project with all of the code from this tutorial.
I’ve shown you how to parse some simple HTML into a data model. I
showed how to grab various bits of information out of the HTML, but you
might want to consider some additions. Can you:
- Parse each tutorial’s HTML data (i.e. the web page at each Tutorial
object’s ‘url’) and extract the contributor who wrote that article?
- Download the image of each contributor and show it in the table view next to the contributor’s name?
- Make the phone open Safari to that tutorial’s or contributor’s URL when you tap on each row?
- Perform the fetching of HTML data and parsing on a background thread so that it doesn’t lock the UI?
I hope you have enjoyed learning about HTML parsing on iOS. If you
have any further questions then I’d love to hear about them in the
forums!
This is a blog post by iOS Tutorial Team member Matt Galloway, founder of SwipeStack, a mobile development team based in London, UK.