+1 to make nicer APIs! It is always good to have more high-quality API designs to look at.
That said, it looks more like an API experiment, not a practical solution for a day job, at least in its current state:
* response body encoding detection is wrong, as it doesn't take meta tags or BOM into account;
* base url detection is wrong, as it doesn't take <base> tag in account;
* URL parsing (joining, etc.) is implemented using string operations instead of stdlib, so a careful inspection is required to make sure it works in edge cases. For example, I can see right away that .absolute_links is wrong for protocol-related urls (e.g. "//ajax.microsoft.com/ajax/jquery/jquery-1.3.2.min.js")
* html2text used for .markdown is GPL - I know people have different opinion on this, but in my book if you import from
a GPL package, your package becomes GPL as well;
* each .xpath call parses a chnk of HTML again, even if a tree is already present
* shortcuts are opinionated and with no clear behavior, e.g. .links deduplicates URLs by default, it deduplicates them using string matches (so e.g. different order of GET arguments => URLs are considering unique); it checks for '.startswith('#")' which looks arbitrary (what if these links are used in a headless browser? what if someone wants to fetch them using _escaped_fragment which many sites still support? why filter out such URLs if they are relative, but not if they are absolute?)
TBH I don't see myself using this package: in its current stage it is very little code, and almost every method has an issue either with edge cases or with API; also, it is tied to requests library, unnecessarily IMHO, and in my opinion it is GPL even if setup.py says it is MIT.
Because there is nothing usable code-wise in requests-html from my point of view (it is no better than existing alternatives), I don't feel like raising these issues, advocating for fixing them, discussing alternative solutions with a goal of improving requests-html. Of course, everyone is free to raise these issues in a repo.
I appreciate the work put into requests-html API design, the design is very nice overall. This might be a way to go: create a nice API design, attract people, fix implementation over time, but this battle is not mine, sorry :(
* encoding detection from <meta> tags doesn't normalize encodings - Python doesn't use the same names as HTML;
* I'm still not sure encoding detection is correct, as it is unclear what are priorities in the current implementation. It should be 1) Content-Type header; 2) BOM marks; 3) encoding in meta tags (or xml declared encoding if you support it); 4) content-based guessing - chardet, etc., or just a default value. I.e. encoding in meta should have less priority than Content-Type header, but more priority than chardet, and if I understand it properly, response.text is decoded both using Content-Type header and chardet.
* lxml's fromstring handles XML (XHTML) encoding declarations, and it may fail in case of unicode data (http://lxml.de/parsing.html#python-unicode-strings), so passing response.text to fromstring is not good. At the same time, relying on lxml to detect encoding is not enough, as http headers should have a higher priority. In parsel we're re-encoding text to utf8, and forcing utf8 parser for lxml to solve it: https://github.com/scrapy/parsel/blob/f6103c8808170546ecf046....
* absolute_links doesn't look correct for base urls which contain path. It also may have issues with urls like tel:1122333, or mailto:.
For encoding detection we're using https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f... in Scrapy. It works well overall; its weakness is that it doesn't require a HTML tree, and doesn't parse it, extracting meta information only from first 4Kb using a regex (4Kb limit is not good). Other than that, it does all the right things AFAIK.
Oooo... I just finished writing a script with BeautifulSoup. While it wasn't all that bad (it works ;) ), I'm sure the "Kenneth Reitz experience" would be much better. I won't be rewriting the script now, I can't wait to find an excuse to try this. :)
EDIT: first commit 22 hours ago - goes to show that when you have thought about the idea and know what you're doing it doesn't take long to produce the first version. :)
I prefer lxml.etree even for HTML on account of the parser. Either way, I don't understand what's not "for humans" about lxml. It provides a ton of simple and useful abstractions, is easy to learn, and is crazy robust and fast under the hood.
I remember years ago I needed to parse some html(which was about 2-3 million characters) and after a fair bit of time, I had it up and running with beautifulsoup. Now, my use case was likely quite atypical to most html parsing, but my god was it ever slow! I forget the exact numbers, but I think it was taking about 150 seconds to complete. So, then I wrote it using lxml, which was an improvement, but that was still taking around 100 seconds.
Now, I very rarely have any need to scrape and parse html data, and I was scratching my head at how it was taking these parsers so long to parse a 3.5 mib html page. I mean, it should be able to go through that and get what I want in under a second right?
So, I said screw it and wrote some regexes. 10-15 seconds was how long it was now taking to parse that html. It actually took 1.5 or so seconds to parse the html; the rest was waiting for it to download the webpage.
Ironically, implementing the regexes was actually quicker than figuring out how to use those html parsers and write the code. Of course, that's assuming you know how to craft regexes. Since it was set-up to run every 5 minutes, I wanted something that could do it without spending 1/2 the time parsing the data(amongst other tasks the processors were needed for).
I actually had the same experience. I was scraping a large number of pages and upon profiling my script, I found out that bs4 was really slow. Changing the parser from the default to lxml helped things a bit, but I decided I would just try a regex to check quickly whether things could be better. Lo and behold, it was much faster. It's true that it's impossible to parse HTML in its entirety with regex, but if you're looking to extract only a portion of data from a page with a known structure, a bit of regex might be the way to go.
Really looking forward to using this. BeautifulSoup is great, but it's counterintuitive (I always need to refer to documentation even for very basic aspects that I've used dozens of times before) and it's often slow and has some really weird xml bugs)
Also check out newspaper3k if you haven't seen it. It is high level, but really useful for a bunch of simple scraping related use cases
> BeautifulSoup is great, but it's counterintuitive (I always need to refer to documentation even for very basic aspects that I've used dozens of times before) and it's often slow and has some really weird xml bugs)
I won't argue the slowness and the occasional bugs, but unlike you I find b4 to be very intuitive. And this is mainly why I use it, despite its faults. Maybe our use case are different, but with a basic knowledge of html, I only rarely find myself reading the documentation. Care to give some examples of what you find counter-intuitive?
That’s just my preference. Keep the features that are best for your specific use cases. It’s your project after all. I think that’s better than design by committee. Linus didn’t write Linux for other people.
If you’re doing markdown conversion well, keep it. Apis are not about doing the smallest thing, it’s about designing a clean abstraction of a problem you’re trying to solve.
Sure, if APIs existed within a vacuum. In the real world we have to worry about ease of packaging, ease of maintenance, and various other practical costs.
For what it is worth, I strongly prefer leaving it in. Sometimes I want to just generate documentation for different things based on different sources and being able to just clean the html by rendering to Markdown would be great.
Nice. It would be pretty cool to have an "run javascript and wait for network idle" optional for scraping js-requiring websites. Can selenium do this? Headless chrome?
I personally use bs4 for web scraping and it works pretty well, but if there was an option to also do js with a sane API, I'd switch in a heartbeat.
We use puppeteer for our smoke tests. Ensure all network requests load, no JavaScript errors, take screenshot of page and simple validations so we know our deploys aren’t borked.
I really love puppeteer over selenium. Much deeper control.
Puppeteer has "networkIdle0" (0 network connections for 500ms) and "networkIdle2" (no more than 2 network connections for 500ms). My experiences with it have been very positive.
Selenium can absolutely be used for this, the caveat being that it is much slower than using a regular HTML parser. In my experience, it's best to milk as much as you possibly can out of a site's plain HTML and APIs, only resorting to Selenium where it's absolutely necessary.
I think it's important to remember for every single new library that comes out, that you are trading apparent usability for unknown issues that haven't surfaced yet because the library is so new.
There is always room for improvement. I think it's easy to underestimate the time required to dwell in a thing before we really understand it. This is true for runtimes, libraries, even entire programming paradigms. We want to take shortcuts using abstract reasoning, and that works well usually, but sometimes you just gotta use something for years. (Alas, one lifetime may be too short to dwell in all the various paradigms the way they each require. But I digress...)
That's a neat little library. The problem is that the web is rapidly moving away from having HTML as the main information carrier to HTML merely being an envelope to deliver a bunch of JavaScript (and if you're even more unlucky: WebAsm).
Scraping the web will become a lot harder in the future.
In saying that, Javascript is just as parsable, and these 'javascript sites' are probably loading structured data via a JSON API which will probably be easier to scrape than a bunch of layout HTML.
puppeteer makes scripting headless Chrome for scraping-style tasks trivial, and it's supported by the Chrome development team so it's likely to keep on working long into the future.
Chrome headless is just that. Chrome that’s headless. Selenium is a cross browser api that connects to browser via some port. But usually it’s a high level api since it’s a common denominator of browser apis. Selenium uses middleware webdriver libs sometimes and usually quite bloated.
Puppeteer uses Chrome Remote Debug Protocol, which is the same protocol Devtools uses. It’s just a simple JSON RPC over websockets. Puppeteer creates a nice library abstraction over this api.
The advantage of puppeteer over selenium is that you have a lot more control over chrome. Network, perf, screenshots, coverage, dom & style traversal, etc. It’s a very reliable API too since Chrome Devtools team maintains the backend.
Selenium on the other hand surprises me with all sorts of quirks.
My evil genius moment with selenium happened when I ran a headless client to visit the login page of a bank, wait for 2fa sms code from my Android application, then use the code to log in and place some automated stock orders.
Even in JS heavy sites, the rendered output is still HTML. You just need to make sure you add a pre-render step. Page.REST, a micro-service I wrote a couple of months back follows this strategy.
Then again things keep becoming more API driven, with the JS delivered just containing the views. If it gets easier or harder remains to be seen I think
Looks like this is built on top of lxml and parse. I built an adapter over the bs4 interface in lxml, which was much faster than using bs4 with an lxml backend.
This is great news, as this space was dominated by bs4 in the Python ecosystem.
To give even more credit where it's due, requests is a nice wrapper around urllib3, which is the work of Andrey Petrov, Cory Benfield and contributors. While requests provides good user-friendly defaults and API semantics, urllib3 does a lot of the heavy lifting.
It wasn't at first, actually. It was originally a wrapper around urllib2 — Andrey and I collabed early on in both project's histories to make them what they are today.
Gotta give credit to whichever cavemen/cavewomen discovered fire, and all the civilizations that invented the wheel, Ben Franklin for harnessing the power of electricity, and then maybe Ken Thompson and Dennis Ritchie for inventing Unix.
I think most people around here know this, but it doesn't really matter. We love `requests` because of its API, which is what Kenneth Reitz contributed. There are many competing HTTP client libs, but only one (that I know of) with intuitive syntax; I hope others follow suit. APIs are not just something that you slam on top of your lib, they should be developed in the same way every other interface is - UX is usually much more important than just pure performance.
In about half of the examples, the methods are more user-friendly wrappers around underlying libraries - great! love it! In the other half of the examples, the niceness of the API is in fact due to the user-friendly design of the underlying PyQuery/Parse libraries. So I want to give credit to their good APIs, too.
I don't get the point, it's just a nicer alternative to beautifulsoup. Html pages are always changing, instead of writing the parsing logic in code, I think we should put xpath or css expressions in some config files.
I'm sure some pages could be represented as css expressions in config files, but in general it seems like scraping is about working around unanticipated idiosyncrasies(e.g. X% of the pages have a different structure for mysterious reasons).
I haven't used xpath much though (and it seems pretty beefy!).
So stoked this popped up. Literally woke up this morning debating about moving away from bs4 and wrapping the functionality I needed in lxml. I was just thinking I wish there was a requests equivalent for parsing...
This looks awesome, can’t wait to try it. Last time I used pyquery though it was considerably more limited than CSS and jQuery syntax, and I reverted to XPath. Has it improved recently?
HTML request bodies are arbitrary bytes. There are nice ways to do useful things with those arbitrary bytes and less nice ways. This library purports to collect some of the nice ones.
"XYZ for Humans" is a trope for the author of this library; for example, requests' tagline is "HTTP for humans".
That’s HTTP, but both it an HTML share a lot of things which are tedious for humans to deal with – encodings, error handling, handling things like attributes which may but usually do not have multiple values, etc. – and most HTML tooling has user-hostile APIs dating back to the XML era when developer convenience was seen as coddling the weak.
That said, it looks more like an API experiment, not a practical solution for a day job, at least in its current state:
* response body encoding detection is wrong, as it doesn't take meta tags or BOM into account;
* base url detection is wrong, as it doesn't take <base> tag in account;
* URL parsing (joining, etc.) is implemented using string operations instead of stdlib, so a careful inspection is required to make sure it works in edge cases. For example, I can see right away that .absolute_links is wrong for protocol-related urls (e.g. "//ajax.microsoft.com/ajax/jquery/jquery-1.3.2.min.js")
* html2text used for .markdown is GPL - I know people have different opinion on this, but in my book if you import from a GPL package, your package becomes GPL as well;
* each .xpath call parses a chnk of HTML again, even if a tree is already present
* shortcuts are opinionated and with no clear behavior, e.g. .links deduplicates URLs by default, it deduplicates them using string matches (so e.g. different order of GET arguments => URLs are considering unique); it checks for '.startswith('#")' which looks arbitrary (what if these links are used in a headless browser? what if someone wants to fetch them using _escaped_fragment which many sites still support? why filter out such URLs if they are relative, but not if they are absolute?)