Using CURL and Screen Scraping To Track Topic Performance in NodeBB
scottalanmiller last edited by
If you have ever worked with forum screen scraping, it can be a handy way to gather data about different things. One way that you might use this tool is to track something like views on a thread that you have been watching or to track comments or such. Using curl, the standard such tool on Linux and a few simple line REGEX and editing tools like grep and cut we can pretty easily grab this information from a NodeBB site like MangoLassi.
Here is an example command that will handle the NodeBB redirects from the RESTful interface and trim the output so make it easy to simply return a numerical value from the XML that is parsed.
curl --location --referer ";auto" --netrc -s -D - http://mangolassi.it/topic/8000 | grep human-readable-number | grep -v topic | cut -d'>' -f2 | cut -d'<' -f1
Because MangoLassi uses URL redirects, you cannot use a vanilla CURL statement for this, but fixing this is easy. Using grep to get to the line that we want is not great because there is no good taxonomy here to refer to the views field, but screen scraping is a quick and dirty business anyway. So this works. A couple of cut commands trim us down to what we are looking for.
The "8000" provided here is an example. Replace "8000" with the number of the thread in which you are interested. The return of this command is just a number, but it represents the number of views of that specific thread.