Requests-HTML: HTML Parsing for Humans

kmike84 · on Feb 26, 2018

+1 to make nicer APIs! It is always good to have more high-quality API designs to look at.

That said, it looks more like an API experiment, not a practical solution for a day job, at least in its current state:

* response body encoding detection is wrong, as it doesn't take meta tags or BOM into account;

* base url detection is wrong, as it doesn't take <base> tag in account;

* URL parsing (joining, etc.) is implemented using string operations instead of stdlib, so a careful inspection is required to make sure it works in edge cases. For example, I can see right away that .absolute_links is wrong for protocol-related urls (e.g. "//ajax.microsoft.com/ajax/jquery/jquery-1.3.2.min.js")

* html2text used for .markdown is GPL - I know people have different opinion on this, but in my book if you import from a GPL package, your package becomes GPL as well;

* each .xpath call parses a chnk of HTML again, even if a tree is already present

* shortcuts are opinionated and with no clear behavior, e.g. .links deduplicates URLs by default, it deduplicates them using string matches (so e.g. different order of GET arguments => URLs are considering unique); it checks for '.startswith('#")' which looks arbitrary (what if these links are used in a headless browser? what if someone wants to fetch them using _escaped_fragment which many sites still support? why filter out such URLs if they are relative, but not if they are absolute?)

EmilStenstrom · on Feb 26, 2018

Would you mind posting these as issues?

kmike84 · on Feb 26, 2018

TBH I don't see myself using this package: in its current stage it is very little code, and almost every method has an issue either with edge cases or with API; also, it is tied to requests library, unnecessarily IMHO, and in my opinion it is GPL even if setup.py says it is MIT.

Because there is nothing usable code-wise in requests-html from my point of view (it is no better than existing alternatives), I don't feel like raising these issues, advocating for fixing them, discussing alternative solutions with a goal of improving requests-html. Of course, everyone is free to raise these issues in a repo.

I appreciate the work put into requests-html API design, the design is very nice overall. This might be a way to go: create a nice API design, attract people, fix implementation over time, but this battle is not mine, sorry :(

kenneth_reitz · on Feb 26, 2018

GPL dependency removed.

All of these improvements I'd like to be made to the software. It's all about getting a nice API in place first, then making it perfect second.

kenneth_reitz · on Feb 26, 2018

I addressed most of your issues, like not using urlparse in the latest release.

With libraries like these, it's all about getting the API right first, them optimizing for perfection second. :)

kenneth_reitz · on Feb 26, 2018

<base> tag is now implemented as well. Thanks for bringing that to my attention — I wasn't aware of it!

kmike84 · on Feb 26, 2018

:thumbs up:

A second iteration of review:

* encoding detection from <meta> tags doesn't normalize encodings - Python doesn't use the same names as HTML;

* I'm still not sure encoding detection is correct, as it is unclear what are priorities in the current implementation. It should be 1) Content-Type header; 2) BOM marks; 3) encoding in meta tags (or xml declared encoding if you support it); 4) content-based guessing - chardet, etc., or just a default value. I.e. encoding in meta should have less priority than Content-Type header, but more priority than chardet, and if I understand it properly, response.text is decoded both using Content-Type header and chardet.

* lxml's fromstring handles XML (XHTML) encoding declarations, and it may fail in case of unicode data (http://lxml.de/parsing.html#python-unicode-strings), so passing response.text to fromstring is not good. At the same time, relying on lxml to detect encoding is not enough, as http headers should have a higher priority. In parsel we're re-encoding text to utf8, and forcing utf8 parser for lxml to solve it: https://github.com/scrapy/parsel/blob/f6103c8808170546ecf046....

* when extracting links, it is not enough to use raw @href attribute values, as they are allowed to have leading and trailing whitespaces (see https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f...)

* absolute_links doesn't look correct for base urls which contain path. It also may have issues with urls like tel:1122333, or mailto:.

For encoding detection we're using https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f... in Scrapy. It works well overall; its weakness is that it doesn't require a HTML tree, and doesn't parse it, extracting meta information only from first 4Kb using a regex (4Kb limit is not good). Other than that, it does all the right things AFAIK.

kenneth_reitz · on Feb 26, 2018

Thanks for the feedback, integrated w3lib!

cup-of-tea · on Feb 26, 2018

Boo, you should have just made it GPL.

amenod · on Feb 25, 2018

Oooo... I just finished writing a script with BeautifulSoup. While it wasn't all that bad (it works ;) ), I'm sure the "Kenneth Reitz experience" would be much better. I won't be rewriting the script now, I can't wait to find an excuse to try this. :)

EDIT: first commit 22 hours ago - goes to show that when you have thought about the idea and know what you're doing it doesn't take long to produce the first version. :)

halflings · on Feb 25, 2018

Would not recommend BeautifulSoup for this type of thing.

lxml.html is much better in my experience. If you want to use CSS selectors, there's pyquery.

staticautomatic · on Feb 25, 2018

I prefer lxml.etree even for HTML on account of the parser. Either way, I don't understand what's not "for humans" about lxml. It provides a ton of simple and useful abstractions, is easy to learn, and is crazy robust and fast under the hood.

Vindicis · on Feb 26, 2018

I remember years ago I needed to parse some html(which was about 2-3 million characters) and after a fair bit of time, I had it up and running with beautifulsoup. Now, my use case was likely quite atypical to most html parsing, but my god was it ever slow! I forget the exact numbers, but I think it was taking about 150 seconds to complete. So, then I wrote it using lxml, which was an improvement, but that was still taking around 100 seconds.

Now, I very rarely have any need to scrape and parse html data, and I was scratching my head at how it was taking these parsers so long to parse a 3.5 mib html page. I mean, it should be able to go through that and get what I want in under a second right?

So, I said screw it and wrote some regexes. 10-15 seconds was how long it was now taking to parse that html. It actually took 1.5 or so seconds to parse the html; the rest was waiting for it to download the webpage.

Ironically, implementing the regexes was actually quicker than figuring out how to use those html parsers and write the code. Of course, that's assuming you know how to craft regexes. Since it was set-up to run every 5 minutes, I wanted something that could do it without spending 1/2 the time parsing the data(amongst other tasks the processors were needed for).

YMMV

Giroflex · on Feb 26, 2018

I actually had the same experience. I was scraping a large number of pages and upon profiling my script, I found out that bs4 was really slow. Changing the parser from the default to lxml helped things a bit, but I decided I would just try a regex to check quickly whether things could be better. Lo and behold, it was much faster. It's true that it's impossible to parse HTML in its entirety with regex, but if you're looking to extract only a portion of data from a page with a known structure, a bit of regex might be the way to go.

jcadam · on Feb 26, 2018

You're using regex to parse html? Have you not read: https://stackoverflow.com/a/1732454/1090568 ?

wodenokoto · on Feb 27, 2018

If you have an HTML document you want to extract information from, regexes are fast and easy.

It's when you don't have guarantees about the structure of the html you are working with, that regex will come up short.

davidwtbuxton · on Feb 25, 2018

You can also use css selectors with lxml. Works great, in my experience.

http://lxml.de/cssselect.html

masklinn · on Feb 26, 2018

> lxml.html is much better in my experience.

lxml.html has a terrible parser.

> If you want to use CSS selectors, there's pyquery.

CSS selectors are built into lxml through cssselect[0] which is used to convert CSS3 to XPath 1.0 selectors.

[0] If you want to use CSS selectors, there's pyquery.

EmilStenstrom · on Feb 26, 2018

Lxml is C so annoying to install in some environments. Also, the API is inconsistent and verbose.

I’ve found more luck with html5lib and cssselect2.

sitkack · on Feb 25, 2018

> Kenneth Reitz Experience

I heard KRE was going to headline at the upcoming PyCon.

timb07 · on Feb 25, 2018

For old UNIX nerds, this could cause some confusion: KRE is Robert Elz, originator of timezone support in BSD UNIX.

ggm · on Feb 26, 2018

And the quota system, and the existence of '.oz' as a domain name which regrettably ISO3166 refused to allocate.

sixhobbits · on Feb 25, 2018

Really looking forward to using this. BeautifulSoup is great, but it's counterintuitive (I always need to refer to documentation even for very basic aspects that I've used dozens of times before) and it's often slow and has some really weird xml bugs)

Also check out newspaper3k if you haven't seen it. It is high level, but really useful for a bunch of simple scraping related use cases

Momquist · on Feb 25, 2018

> BeautifulSoup is great, but it's counterintuitive (I always need to refer to documentation even for very basic aspects that I've used dozens of times before) and it's often slow and has some really weird xml bugs)

I won't argue the slowness and the occasional bugs, but unlike you I find b4 to be very intuitive. And this is mainly why I use it, despite its faults. Maybe our use case are different, but with a basic knowledge of html, I only rarely find myself reading the documentation. Care to give some examples of what you find counter-intuitive?

kcole16 · on Feb 25, 2018

Same! Python requests is one of my favorite libraries of all time. Kenneth Reitz is a treasure.

realhamster · on Feb 25, 2018

A similar library is https://github.com/tryolabs/requestium

Though it adds parsel as a parser (which has a really nice api) to requests. It also integrates with selenium.

friendlydude12 · on Feb 25, 2018

> Render an Element as Markdown:

I prefer my dependencies to be orthogonal and lightweight i.e. do one thing well. Maybe this is better for interactive use in the REPL.

kenneth_reitz · on Feb 25, 2018

I consider it a nice-to-have, but we can definitely remove it if deemed unnecessary. Want to open an issue about it?

friendlydude12 · on Feb 25, 2018

That’s just my preference. Keep the features that are best for your specific use cases. It’s your project after all. I think that’s better than design by committee. Linus didn’t write Linux for other people.

nojvek · on Feb 26, 2018

If you’re doing markdown conversion well, keep it. Apis are not about doing the smallest thing, it’s about designing a clean abstraction of a problem you’re trying to solve.

friendlydude12 · on Feb 26, 2018

Sure, if APIs existed within a vacuum. In the real world we have to worry about ease of packaging, ease of maintenance, and various other practical costs.

3pt14159 · on Feb 25, 2018

For what it is worth, I strongly prefer leaving it in. Sometimes I want to just generate documentation for different things based on different sources and being able to just clean the html by rendering to Markdown would be great.

arikfr · on Feb 26, 2018

I personally find it useful, although I can see why it feels out of place.

If you do decide to remove it, I think it's worth adding as an example in the README.

mixmastamyk · on Feb 26, 2018

Could be made an optional feature in setup.py.

EmilStenstrom · on Feb 26, 2018

Definitely shouldn’t be there.

ivansavz · on Feb 25, 2018

Nice. It would be pretty cool to have an "run javascript and wait for network idle" optional for scraping js-requiring websites. Can selenium do this? Headless chrome?

I personally use bs4 for web scraping and it works pretty well, but if there was an option to also do js with a sane API, I'd switch in a heartbeat.

nojvek · on Feb 26, 2018

We use puppeteer for our smoke tests. Ensure all network requests load, no JavaScript errors, take screenshot of page and simple validations so we know our deploys aren’t borked.

I really love puppeteer over selenium. Much deeper control.

bshimmin · on Feb 25, 2018

Puppeteer has "networkIdle0" (0 network connections for 500ms) and "networkIdle2" (no more than 2 network connections for 500ms). My experiences with it have been very positive.

_bohm · on Feb 26, 2018

Selenium can absolutely be used for this, the caveat being that it is much slower than using a regular HTML parser. In my experience, it's best to milk as much as you possibly can out of a site's plain HTML and APIs, only resorting to Selenium where it's absolutely necessary.

javajosh · on Feb 25, 2018

The output of r.html.absolute_links on the home page looks like it contains errors. For example,

  'https://www.python.org//docs.python.org/3/tutorial/'

  'https://www.python.org//docs.python.org/3/tutorial/controlflow.html#defining-functions'

I think it's important to remember for every single new library that comes out, that you are trading apparent usability for unknown issues that haven't surfaced yet because the library is so new.

kenneth_reitz · on Feb 25, 2018

That bug has been fixed, but the docs hadn't been. Fixed now :)

kenneth_reitz · on Feb 25, 2018

Looks like there's still some room for improvement though!

javajosh · on Feb 25, 2018

There is always room for improvement. I think it's easy to underestimate the time required to dwell in a thing before we really understand it. This is true for runtimes, libraries, even entire programming paradigms. We want to take shortcuts using abstract reasoning, and that works well usually, but sometimes you just gotta use something for years. (Alas, one lifetime may be too short to dwell in all the various paradigms the way they each require. But I digress...)

mixmastamyk · on Feb 25, 2018

Always, very nice work however. Thanks for the pipenv also, makes working with venvs tolerable.

ameliaquining · on Feb 25, 2018

Interesting! How does this compare with MechanicalSoup (which seems to be the current best-in-class solution for scraping in Python)?

kenneth_reitz · on Feb 25, 2018

Very different use cases, imo. MechanicalSoup emulates a web browser experience. This is more for scraping.

nbrempel · on Feb 25, 2018

I love Kenneth’s work.

He’s absolutely focused on user experience – in this case the developer experience – and it absolutely shows.

I’m sure I’ll end up using this utility at some point in the future.

Const-me · on Feb 25, 2018

Same functionality for .NET: http://html-agility-pack.net/

zerkten · on Feb 26, 2018

There is overlap, but this isn't the same as Requests-HTML. It is nice that this library has been developed further over the years.

nishs · on Feb 25, 2018

> Select an element with a jQuery selector.

What is a 'jQuery' selector? Is it the same as CSS selectors, or does jQuery support non-standard syntax?

madeofpalk · on Feb 25, 2018

For what it's worth, yes, JQuery does support non-standard syntax/selectors

kenneth_reitz · on Feb 25, 2018

updated :)

ecthiender · on Feb 25, 2018

Kenneth Reitz comes out with yet another good UI (in terms of library) to accomplish daily mundane tasks, with joy.

jacquesm · on Feb 25, 2018

That's a neat little library. The problem is that the web is rapidly moving away from having HTML as the main information carrier to HTML merely being an envelope to deliver a bunch of JavaScript (and if you're even more unlucky: WebAsm).

Scraping the web will become a lot harder in the future.

madeofpalk · on Feb 25, 2018

In saying that, Javascript is just as parsable, and these 'javascript sites' are probably loading structured data via a JSON API which will probably be easier to scrape than a bunch of layout HTML.

simonw · on Feb 25, 2018

I'm a lot less worried about that thanks to Chrome Headless and the puppeteer library: https://github.com/GoogleChrome/puppeteer

puppeteer makes scripting headless Chrome for scraping-style tasks trivial, and it's supported by the Chrome development team so it's likely to keep on working long into the future.

ivansavz · on Feb 25, 2018

Is there a difference between Chrome+puppeteer vs Chrome+selenium?

Is there anything special that Chrome headless has that ordinary Chrome wouldn't have?

nojvek · on Feb 26, 2018

Chrome headless is just that. Chrome that’s headless. Selenium is a cross browser api that connects to browser via some port. But usually it’s a high level api since it’s a common denominator of browser apis. Selenium uses middleware webdriver libs sometimes and usually quite bloated.

Puppeteer uses Chrome Remote Debug Protocol, which is the same protocol Devtools uses. It’s just a simple JSON RPC over websockets. Puppeteer creates a nice library abstraction over this api.

The advantage of puppeteer over selenium is that you have a lot more control over chrome. Network, perf, screenshots, coverage, dom & style traversal, etc. It’s a very reliable API too since Chrome Devtools team maintains the backend.

Selenium on the other hand surprises me with all sorts of quirks.

nurettin · on Feb 26, 2018

Been using selenium and webdriver with ruby for the past five years, there is pretty much nothing it cannot do. Performance is a big hit, though.

make3 · on Feb 26, 2018

I really love Selenium actually.. makes me feel.. powrful, somehow, like an evil genius or something

nurettin · on Feb 26, 2018

My evil genius moment with selenium happened when I ran a headless client to visit the login page of a bank, wait for 2fa sms code from my Android application, then use the code to log in and place some automated stock orders.

simonw · on Feb 25, 2018

I know plenty of people who dislike working with Selenium. I haven't yet met anyone who's had enough experience with Puppeteer to hate it yet.

tobyhinloopen · on Feb 26, 2018

> Scraping the web will become a lot harder in the future.

Meh, if you're lucky you can grab raw data from API requests. Otherwise just let the javascript execute in a headless browser and continue scraping.

jpalomaki · on Feb 25, 2018

Or easier? You just need to look the endpoints the Javascript is using to get the data in JSON.

laktek · on Feb 25, 2018

Even in JS heavy sites, the rendered output is still HTML. You just need to make sure you add a pre-render step. Page.REST, a micro-service I wrote a couple of months back follows this strategy.

MildlySerious · on Feb 25, 2018

Then again things keep becoming more API driven, with the JS delivered just containing the views. If it gets easier or harder remains to be seen I think

darpa_escapee · on Feb 25, 2018

Looks like this is built on top of lxml and parse. I built an adapter over the bs4 interface in lxml, which was much faster than using bs4 with an lxml backend.

This is great news, as this space was dominated by bs4 in the Python ecosystem.

Can't wait to use this in the future :)

coolgoose · on Feb 25, 2018

Also in PHP with Symfony Dom Crawler: https://symfony.com/doc/current/components/dom_crawler.html or Goutte for an easy to use web scraper https://github.com/FriendsOfPHP/Goutte that uses Dom Crawler

brilee · on Feb 25, 2018

This is a nice wrapper around requests, pyquery https://github.com/gawel/pyquery/ and parse https://github.com/r1chardj0n3s/parse of which only requests is Kenneth Reitz. Let's give credit where it's due.

ak217 · on Feb 25, 2018

To give even more credit where it's due, requests is a nice wrapper around urllib3, which is the work of Andrey Petrov, Cory Benfield and contributors. While requests provides good user-friendly defaults and API semantics, urllib3 does a lot of the heavy lifting.

kenneth_reitz · on Feb 25, 2018

It wasn't at first, actually. It was originally a wrapper around urllib2 — Andrey and I collabed early on in both project's histories to make them what they are today.

halflings · on Feb 25, 2018

requests does much more on top of urllib3 than this new requests-html does on top of requests + pyquery.

It greatly simplifies the most common patterns in doing HTTP requests; things like authentication, passing headers, retry logic, etc.

tty7 · on Feb 25, 2018

Which is written in python, so better give credit for that.

Wow and now python is in C so line up your credit books, we are in for a long night tonight.

Thoughts & prays for all contributors

0172 · on Feb 26, 2018

To quote Carl Sagan, "If you wish to make an apple pie from scratch, you must first invent the universe."

d33 · on Feb 25, 2018

...I know you're trolling, but I seriously wonder where it would stop if we were to go down all that way.

patneedham · on Feb 25, 2018

Gotta give credit to whichever cavemen/cavewomen discovered fire, and all the civilizations that invented the wheel, Ben Franklin for harnessing the power of electricity, and then maybe Ken Thompson and Dennis Ritchie for inventing Unix.

amenod · on Feb 25, 2018

I think most people around here know this, but it doesn't really matter. We love `requests` because of its API, which is what Kenneth Reitz contributed. There are many competing HTTP client libs, but only one (that I know of) with intuitive syntax; I hope others follow suit. APIs are not just something that you slam on top of your lib, they should be developed in the same way every other interface is - UX is usually much more important than just pure performance.

iiv · on Feb 25, 2018

What do you mean "give credit where it's due"? It's not common (in my experience) to thank every dependency's creator, right?

brilee · on Feb 25, 2018

In about half of the examples, the methods are more user-friendly wrappers around underlying libraries - great! love it! In the other half of the examples, the niceness of the API is in fact due to the user-friendly design of the underlying PyQuery/Parse libraries. So I want to give credit to their good APIs, too.

vinceguidry · on Feb 25, 2018

Is there a shell interface, or do I have to call it through python? It would be nice to be able to use it in a Ruby project.

ospider · on Feb 26, 2018

I don't get the point, it's just a nicer alternative to beautifulsoup. Html pages are always changing, instead of writing the parsing logic in code, I think we should put xpath or css expressions in some config files.

closed · on Feb 26, 2018

I'm sure some pages could be represented as css expressions in config files, but in general it seems like scraping is about working around unanticipated idiosyncrasies(e.g. X% of the pages have a different structure for mysterious reasons).

I haven't used xpath much though (and it seems pretty beefy!).

vapemaster · on Feb 25, 2018

So stoked this popped up. Literally woke up this morning debating about moving away from bs4 and wrapping the functionality I needed in lxml. I was just thinking I wish there was a requests equivalent for parsing...

halflings · on Feb 25, 2018

Is there any useful feature in bs4 that is not available in lxml.html?

Lxr · on Feb 25, 2018

This looks awesome, can’t wait to try it. Last time I used pyquery though it was considerably more limited than CSS and jQuery syntax, and I reverted to XPath. Has it improved recently?

fao_ · on Feb 25, 2018

HTML requests are: plain text, self-explanatory ("Content-length", "charset", etc.). What exactly is unhuman about that?

lvh · on Feb 25, 2018

HTML request bodies are arbitrary bytes. There are nice ways to do useful things with those arbitrary bytes and less nice ways. This library purports to collect some of the nice ones.

"XYZ for Humans" is a trope for the author of this library; for example, requests' tagline is "HTTP for humans".

dullgiulio · on Feb 25, 2018

That's HTTP, not HTML.

fao_ · on Feb 25, 2018

Yikes, major misread on my part. Whoops.

acdha · on Feb 25, 2018

That’s HTTP, but both it an HTML share a lot of things which are tedious for humans to deal with – encodings, error handling, handling things like attributes which may but usually do not have multiple values, etc. – and most HTML tooling has user-hostile APIs dating back to the XML era when developer convenience was seen as coddling the weak.