How to Parse HTML on iOS

IPHONE 2014. 4. 16. 15:58
반응형

This is a blog post by iOS Tutorial Team member Matt Galloway, founder of SwipeStack, a mobile development team based in London, UK. You can also find me on .

Let’s say you want to find some information inside a web page and display it in a custom way in your app.

This technique is called “scraping.” Let’s also assume you’ve thought through alternatives to scraping web pages from inside your app, and are pretty sure that’s what you want to do.

Well then you get to the question – how can you programmatically dig through the HTML and find the part you’re looking for, in the most robust way possible? Believe it or not, regular expressions won’t cut it!

Well, in this tutorial you’ll find out how! You’ll get hands-on experience with parsing HTML into an Objective-C data model that your apps can use.

In fact, you’ll work with some HTML from this very site, downloading a list of tutorials and also a list of the members of the iOS Tutorial Team (who are quite awesome, if I do say so myself).

Even if you are pretty sure you never want to parse HTML in your apps, you might enjoy this tutorial anyway, because it covers some cool things you can do with XML and querying its elements with XPath.

This tutorial assumes some familiarity with Objective-C and iOS programming. If you are a complete beginner, you may wish to check out some of the other tutorials on this site.

Let’s start scraping!

But Wait!

Are you sure that's what you want to be doing?

Are you sure that's what you want to be doing?

Before you begin parsing/scraping web pages in your app, you should first make sure that this is really the best choice for you.

Scraping web pages from your app is not always the best choices because:

  • Do you own the content? If you own the content you’re scraping, no problem. But if you don’t, it’s kinda dodgy legally. You can probably get away with it but there are serious questions here. And remember, IANAL – so for more reading, check out this, this, or even better – check with your lawyer.
  • What if the content changes? Since web pages are dynamic things, remember at the format of the page could change at any time. If you’ve hard-baked assumptions on the format of the web page within your app, it might break your app, and you’d have to wait through the long App Store review cycle to get it fixed.
  • Should you be using a web service instead? If you really want to be getting data from the web, you should consider using a web service instead that can return you nice, beautiful XML or JSON. You could even put the scraping code on the server side instead of in your app. This has a number of advantages, including offloading the heavy lifting to the server side, and only having to perform it 1 time, instead of 1 time per running app instance.
  • So – assuming you’ve thought this all through, and you’re *really sure* this is what you want to do, here’s how! But don’t say we didn’t warn you ;]

    Getting Started: How to Climb Trees

    As you’re probably aware, HTML (HyperText Markup Language) is a markup language (it’s in the name!) that tells browsers how to layout a web page. By its very nature, this content is in a hierarchy that defines where within the page a piece of information is to be displayed.

    You may also be aware of XML (eXtensible Markup Language). This also defines a hierarchy of information, and you may at this point be thinking that perhaps HTML is related to XML. You’d be right to think that, and also wrong!

    There are two flavors of HTML: the one that is pure XML, and the original, where-it-all-started HTML. You can read about the difference over at Wikipedia, but it’s sufficient for the purposes of this tutorial to know that HTML is “sort of” an XML document, but with more relaxed rules.

    Since an XML document has a natural hierarchy in a tree structure, it makes sense to have some kind of language to describe retrieving portions of that tree. This is where XPath comes in. XPath is a language for selecting portions of an XML document. Fortunately for you, it works just as well with an HTML document.

    For example, consider this portion of HTML:

    <html>
      <head>
        <title>Some webpage</title>
      </head>
      <body>
        <p class=”normal”>This is the first paragraph</p>
        <p class=”special”>This is the second paragraph. <b>This is in bold.</b></p>
      </body>
    </html>

    This clearly is in a tree structure that looks like this:

    Based on the above diagram, if you wanted to access the title of the HTML document, then you could use the following XPath expression to walk the tree and return the corresponding node:

    /html/head/title

    This would yield a node with just one child: the text “Some webpage.”

    Similarly, if you wanted to access the second paragraph, you could use the following XPath expression:

    /html/body/p[@class=‘special’]

    This would give you access to the node that represents the portion of the tree underneath <p class=’special’>. Note that you have used the syntax [@class=‘special’] to say that you want the nodes which are at html -> body -> p, where the <p> tag has the “class” attribute set to “special.” If there were more than one <p> tag with that class, then this expression would have returned an array of the nodes. But in this case, there’s only one.

    With that knowledge in hand, you can now write XPath queries to access anything within the tree!

    Getting Started for Real: Libxml2 and Hpple

    Parsing an XML document into a manageable format is a pretty complex process. But never fear, there is a handy little library that’s included in the iOS SDK called libxml2.

    This may sound scary at first. A C library without a pretty Objective-C wrapping?

    Fortunately, thanks to some excellent development, there is an open source library called hpple that wraps libxml2 nicely using Objective-C objects. Hpple wraps the creation of the XML document structure, as well as the XPath querying.

    While you may feel like you have the hiccups every time you see the word, in this tutorial you will be using hpple to parse HTML.

    Start Xcode and go to File\New\Project, select iOS\Application\Master Detail Application and click Next. Set up the project like so:

    • Project name: HTMLParsing
    • Company Identifier: Your usual reverse DNS identifier
    • Class Prefix: Leave blank
    • Device Family: iPhone
    • Use Storyboards: No
    • Use Automatic Reference Counting: Yes
    • Include Unit Tests: No (you’re living life on the edge)

    Click Next and finally, choose a location to save your project.

    Creating the Data Model

    You’re going to be downloading tutorials and contributor names from raywenderlich.com, so it would be nice to have these objects modeled in an Objective-C class for easy access. I know you like to keep your project organized, so create a group in the project called Model under the root HTMLParsing group like so (right-click on the HTMLParsing folder to get the context menu):

    Next create a new file under the Model group by selecting Model, then clicking File\New\File (or right-clicking on the folder and selecting New File…). Select Cocoa Touch\Objective-C class and click Next. Enter “Tutorial” as the class, and make it a subclass of NSObject. Finally, click Next and save it along with the rest of the project.

    Now select Tutorial.h and make the interface look like this:

    @interface Tutorial : NSObject
     
    @property (nonatomic, copy) NSString *title;
    @property (nonatomic, copy) NSString *url;
     
    @end

    Then select Tutorial.m and make the implementation look like this:

    @implementation Tutorial
     
    @synthesize title = _title;
    @synthesize url = _url;
     
    @end

    Now create another class, again under the Model group, and call it “Contributor.” Like before, make it a subclass of NSObject. Then make the interface and implementation look like the following:

    // Interface
    @interface Contributor : NSObject
     
    @property (nonatomic, copy) NSString *name;
    @property (nonatomic, copy) NSString *imageUrl;
     
    @end
     
    // Implementation
    @implementation Contributor
     
    @synthesize name = _name;
    @synthesize imageUrl = _imageUrl;
     
    @end

    Adding the Hpple Code

    Note: If you are comfortable with git, then you might want to consider doing the following by cloning the git repository locally, rather than downloading the ZIP file.

    The hpple project is hosted on GitHub, so open a browser and point it to https://github.com/topfunky/hpple. Click the “ZIP” button to download a ZIP file containing the project. Unzip it and open the resulting folder in Finder. You should see something like this:

    Now create another group under the HTMLParsing group called “hpple,” and drag the TFHpple.h/.m, TFHppleElement.h/.m and XPathQuery.h/.m files to the newly created group. When you do this, make sure that you opt to copy the files to the destination group’s folder and add them to the HTMLParsing target:

    Since hpple makes use of libxml2, you need to tell your project where to find the libxml2 headers, and also to link against it when building.

    To do this, select the project root at the top of the project navigator, go to Build Settings and search for “header search paths.” Enter the value for the Header Search Paths row as $(SDKROOT)/usr/include/libxml2 and press Enter. It should end up looking like this:

    Next select Build Phases and open the “Link Binary With Libraries” section. Click the (+) button and search for libxml2. Select libxml2.dylib and press Add. The project navigator should now look like this:

    If you build and run the project now, everything should compile and link, and you’ll be presented with the standard app that’s created with the Master-Detail Application template you opted to use:

    Sit on Your Arse and Parse

    Now that everything is set up, go ahead and parse some HTML! Your first trick will be to parse http://www.raywenderlich.com/tutorials for a list of tutorials. If you open the site’s homepage in your favorite browser and view the source of the page, you should find something in there like this:

    <div class="content-wrapper">
    <h3>Beginning iPhone Programming</h3>
        <ul>
            <li><a href="/?p=1797">How To Create a Simple iPhone App on iOS 5 Tutorial: 1/3</a></li>
            <li><a href="/?p=1845">How To Create a Simple iPhone App on iOS 5 Tutorial: 2/3</a></li>
            <li><a href="/?p=1888">How To Create a Simple iPhone App on iOS 5 Tutorial: 3/3</a></li>
            <li><a href="/?p=10209">My App Crashed &#8211; Now What? 1/2</a></li>
            <li><a href="/?p=10505">My App Crashed &#8211; Now What? 2/2</a></li>
            <li><a href="/?p=8003">How to Submit Your App to Apple: From No Account to App Store, Part 1</a></li>
            <li><a href="/?p=8045">How to Submit Your App to Apple: From No Account to App Store, Part 2</a></li>
        </ul>
    </div>

    Note: A lot of irrelevant code has been trimmed out for clarity.

    If you draw that in tree format, you come up with something like this:

    Tutorials tree

    It should be clear that you can obtain all the tutorials by finding all the <a> tags within the <li> tags, which are under <ul> tags, which are in the <div> tag with “class=’content-wrapper’.” An XPath expression that obtains these is:

    //div[@class='content-wrapper']/ul/li/a

    Note: The double slash (//) at the front means “search anywhere in the document for the following tag.” This stops you having to go right from the top of the tree down through html, then body, etc.

    Having located all of the <a> tags, you will then be interested in the “href” attributes of the <a> tags, and also the text contents within.

    Open MasterViewController.m and add the following imports at the top, since you will need to use these classes later on:

    #import "TFHpple.h"
    #import "Tutorial.h"
    #import "Contributor.h"

    Next, add the following method above initWithNibName:bundle:, which will load the list of tutorials from raywenderlich.com:

    -(void)loadTutorials {
        // 1
        NSURL *tutorialsUrl = [NSURL URLWithString:@"http://www.raywenderlich.com/tutorials"];
        NSData *tutorialsHtmlData = [NSData dataWithContentsOfURL:tutorialsUrl];
     
        // 2
        TFHpple *tutorialsParser = [TFHpple hppleWithHTMLData:tutorialsHtmlData];
     
        // 3
        NSString *tutorialsXpathQueryString = @"//div[@class='content-wrapper']/ul/li/a";
        NSArray *tutorialsNodes = [tutorialsParser searchWithXPathQuery:tutorialsXpathQueryString];
     
        // 4
        NSMutableArray *newTutorials = [[NSMutableArray alloc] initWithCapacity:0];
        for (TFHppleElement *element in tutorialsNodes) {
            // 5
            Tutorial *tutorial = [[Tutorial alloc] init];
            [newTutorials addObject:tutorial];
     
            // 6
            tutorial.title = [[element firstChild] content];
     
            // 7
            tutorial.url = [element objectForKey:@"href"];
        }
     
        // 8
        _objects = newTutorials;
        [self.tableView reloadData];
    }

    This might look scary, but let’s break it down and see what’s going on:

    1. First you need to download the web page, so you create an NSURL with the appropriate URL string. Then you create an NSData object with the contents of that URL. This means “tutorialsHtmlData” will contain the entire HTML document in raw data form.

      If you wanted, you could create an NSString from this using NSString’s alloc/initWithData:usingEncoding: to see the data. It would be the same as if you were to “view source” in your browser.

      Note:
      dataWithContentsOfURL:
      will block until the data has been returned. This means that the UI will become unresponsive until the data is fetched from the server. A better approach is to use NSURLConnection to asynchronously grab the data, but that’s beyond the scope of this tutorial.

    2. Next you create a TFHpple parser with the data that you downloaded.
    3. Then you set up the appropriate XPath query and ask the parser to search using the query. This will return an array of nodes (in hpple land, these are TFHppleElement objects).
    4. Then you create an array to hold your new tutorial objects and loop over the obtained nodes.
    5. Inside the loop, you first create a new Tutorial object and add it to the array.
    6. Then you get the tutorial’s title from the node’s first child’s contents. If you look back at the tree, you should be able to see that this is the case.
    7. Then you get the tutorial’s URL from the “href” attribute of the node. It’s an <a> tag, so it gives you the linking URL. In our case, this is the tutorial’s URL.
    8. Finally you set _objects on the view controller to the new tutorials array you created, and ask the table view to reload its data.

    Before you build and run, do some spring cleaning on this class to remove some of the default behavior of the template project. Remove the insertNewObject: method, and change viewDidLoad to look like:

    -(void)viewDidLoad {
        [super viewDidLoad];
     
        [self loadTutorials];
    }

    Make the tableView:cellForRowAtIndexPath: look like this:

    -(UITableViewCell *)tableView:(UITableView *)tableView cellForRowAtIndexPath:(NSIndexPath *)indexPath {
        static NSString *CellIdentifier = @"Cell";
     
        UITableViewCell *cell = [tableView dequeueReusableCellWithIdentifier:CellIdentifier];
        if (cell == nil) {
            cell = [[UITableViewCell alloc] initWithStyle:UITableViewCellStyleSubtitle reuseIdentifier:CellIdentifier];
            cell.accessoryType = UITableViewCellAccessoryDisclosureIndicator;
        }
     
        Tutorial *thisTutorial = [_objects objectAtIndex:indexPath.row];
        cell.textLabel.text = thisTutorial.title;
        cell.detailTextLabel.text = thisTutorial.url;
     
        return cell;
    }

    Here we simply set the main label and the detail label to the tutorial title and URL.

    Build and run, and you should be greeted with a list of tutorials!

    The Fellowship of the Tutorial

    Next up, you’ll be downloading the list of Ray’s contributors, i.e. the fellowship of the iOS Tutorial Team. If you open http://www.raywenderlich.com/about in your favorite browser and “View Source” again, somewhere in the file you should see something like this:

    <ul class="team-members">
        <li id='mgalloway'>
            <h3>Matt Galloway (Editor, Tutorial Team Member)</h3>
            <img src='/wp-content/images/authors/mgalloway.jpg' alt='Matt Galloway' width='100' height='100'>
        </li>
    </ul>

    In a tree structure, it looks like this:

    Contributors tree

    This time, your corresponding XPath expression looks like this:

    //ul[@class='team-members']/li

    This translates to: get me all the <li> tags which are children of a <ul> tag that has “class=’team-members’”.

    Back in MasterViewController.m, add an instance variable to the class continuation category as follows (the class continuation category is the section beginning with @interface MasterViewController () at the top of the file, right below the imports section):

    @interface MasterViewController () {
        NSMutableArray *_objects;
        NSMutableArray *_contributors;
    }

    You will be adding the contributors to the new _contributors array.

    Next add the following method below loadTutorials in MasterViewController.m:

    -(void)loadContributors {
        // 1
        NSURL *contributorsUrl = [NSURL URLWithString:@"http://www.raywenderlich.com/about"];
        NSData *contributorsHtmlData = [NSData dataWithContentsOfURL:contributorsUrl];
     
        // 2
        TFHpple *contributorsParser = [TFHpple hppleWithHTMLData:contributorsHtmlData];
     
        // 3
        NSString *contributorsXpathQueryString = @"//ul[@class='team-members']/li";
        NSArray *contributorsNodes = [contributorsParser searchWithXPathQuery:contributorsXpathQueryString];
     
        // 4
        NSMutableArray *newContributors = [[NSMutableArray alloc] initWithCapacity:0];
        for (TFHppleElement *element in contributorsNodes) {
            // 5
            Contributor *contributor = [[Contributor alloc] init];
            [newContributors addObject:contributor];
     
            // 6
            for (TFHppleElement *child in element.children) {
                if ([child.tagName isEqualToString:@"img"]) {
                    // 7
                    @try {
                        contributor.imageUrl = [@"http://www.raywenderlich.com" stringByAppendingString:[child objectForKey:@"src"]];
                    }
                    @catch (NSException *e) {}
                } else if ([child.tagName isEqualToString:@"h3"]) {
                    // 8
                    contributor.name = [[child firstChild] content];
                }
            }
        }
     
        // 9
        _contributors = newContributors;
        [self.tableView reloadData];
    }

    This should look familiar. That’s because it’s very similar to the loadTutorials method you wrote! This time, though, there’s a slightly more work that needs to be done to extract the relevant information about the contributors. Here’s what it all means:

    1. Same as before, except this time grabbing a different URL. This time we’re using the main page, so we can get the list of the cool guys and gals on the sidebar.
    2. Again, creating a TFHpple parser.
    3. Execute your desired XPath query.
    4. Create a new array and loop over the found nodes.
    5. Create a new Contributor object and add it to your array.
    6. You need to get at the name and image URL elements from the <h3> and <img> tags, which are children of the <li> tag. So you loop over the children and pull out the relevant details as you find them.
    7. If this child is an <img> tag, then the “src” attribute tells you the image URL. Note that this is wrapped in a @try{}@catch{} because sometimes internally within hpple, an exception is thrown. I hope that will be fixed upstream at some point.
    8. If this child is a <h3> tag, then the first child (the text node) will tell you the name of the contributor.
    9. As before, set the view controller’s _contributors array to the new one you created, and reload the table data.

    All that’s left is to make the table view display the new contributor data. Change the following table view data source methods to look like this:

    -(NSString*)tableView:(UITableView *)tableView titleForHeaderInSection:(NSInteger)section {
        switch (section) {
            case 0:
                return @"Tutorials";
                break;
            case 1:
                return @"Contributors";
                break;
        }
        return nil;
    }
     
    -(NSInteger)numberOfSectionsInTableView:(UITableView *)tableView {
        return 2;
    }
     
    -(NSInteger)tableView:(UITableView *)tableView numberOfRowsInSection:(NSInteger)section {
        switch (section) {
            case 0:
                return _objects.count;
                break;
            case 1:
                return _contributors.count;
                break;
        }
        return 0;
    }
     
    -(UITableViewCell *)tableView:(UITableView *)tableView cellForRowAtIndexPath:(NSIndexPath *)indexPath {
        static NSString *CellIdentifier = @"Cell";
     
        UITableViewCell *cell = [tableView dequeueReusableCellWithIdentifier:CellIdentifier];
        if (cell == nil) {
            cell = [[UITableViewCell alloc] initWithStyle:UITableViewCellStyleSubtitle reuseIdentifier:CellIdentifier];
            cell.accessoryType = UITableViewCellAccessoryDisclosureIndicator;
        }
     
        if (indexPath.section == 0) {
            Tutorial *thisTutorial = [_objects objectAtIndex:indexPath.row];
            cell.textLabel.text = thisTutorial.title;
            cell.detailTextLabel.text = thisTutorial.url;
        } else if (indexPath.section == 1) {
            Contributor *thisContributor = [_contributors objectAtIndex:indexPath.row];
            cell.textLabel.text = thisContributor.name;
        }
     
        return cell;
    }

    Finally, add the following at the bottom of your viewDidLoad:

    [self loadContributors];

    Then build and run. You should see a list of not only the tutorials but also the contributors! Great work!

    Where to Go From Here?

    Here is a sample project with all of the code from this tutorial.

    I’ve shown you how to parse some simple HTML into a data model. I showed how to grab various bits of information out of the HTML, but you might want to consider some additions. Can you:

    • Parse each tutorial’s HTML data (i.e. the web page at each Tutorial object’s ‘url’) and extract the contributor who wrote that article?
    • Download the image of each contributor and show it in the table view next to the contributor’s name?
    • Make the phone open Safari to that tutorial’s or contributor’s URL when you tap on each row?
    • Perform the fetching of HTML data and parsing on a background thread so that it doesn’t lock the UI?

    I hope you have enjoyed learning about HTML parsing on iOS. If you have any further questions then I’d love to hear about them in the forums!


    This is a blog post by iOS Tutorial Team member Matt Galloway, founder of SwipeStack, a mobile development team based in London, UK.


    반응형
    Posted by 컴스터
    ,