ngreve/web_scraper
Contribute to ngreve/web_scraper development by creating an account on GitHub.
If the source code is all you need

What is this Project About?

First of all lets see what web scraping actually is. Wikipedia says:

"Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites" -Wikipedia: Web scraping on September, 12th 2019

Furthermore, web scraping is a tool/technique to collect data for big data analysis. We will use it to build a Node.js application that notifies us, as soon as a new Linux kernel is available.

In the first step we will set up our Node environment, then we will think about an algorithm to extract the current kernel version from Kernel.org. After we can successfully extract the kernel version, we will expand our web scraper with a notification service, that will email us, as soon as the version number of the Linux kernel has changed.

Requirements

It is not hard to build a web scraper, especially if the information you are interested in, is on a static website. Anyway, to follow this project you have to be familiar with some topics:

  • JavaScript
  • You have to know how websites are built and what DOMs are
  • You have to know how to use the developer tools of your web browser (I will use Firefox for this project).
  • You also need a working installation of Node.js

Rules of Web Scraping

At this point I want to mention some unwritten and written rules when it comes to web scraping.

One big point is that the developer of the web scraper has to ensure that the target site is not gonna be overloaded with the requests coming from the web scraper.

The second point is that the developer must respect the robots.txt file. The robots.txt file tells you what paths a robot (our web scraper) is allowed to visit. You can find this file at http://www.example.com/robots.txt. A working example can be found at https://duckduckgo.com/robots.txt

The third point should be obvious. Of course, you have to respect the general business terms of your target. It doesn't matter if a robots.txt file exists or not. If the business terms do not allow web scraping, your last chance is to contact the website owner and ask for restricted permission.

I am not responsible for any scraping project you are gonna start with the here provided information.

The Target: Kernel.org

The latest kernel version can be downloaded from Kernel.org. The current kernel version can be found in the top right corner. To be able to tell our web scraper where the kernel version can be found, we have to describe it in a more technical way. So let's take a look at Kernel.org with the browser developer tools. Use the Inspector tool to find the anchor element which contains the kernel version number.

Finding attributes, which makes the information identifiable

Fortunately the elements light up, when you move the mouse over the source code. You will see that the anchor element, that contains the  version number, is nested within a table cell with the ID #latest_link. This makes the place of the information identifiable.

So the core algorithm is pretty simple:

  1. After starting the web scraper, it has to get the current kernel version from the  element with the ID #latest_link and store it for comparison.
  2. Every X minutes it will make a request to Kernel.org, to get the kernel version and compare it against the stored kernel version.
  3. If the kernel version differs from the stored version, it will send a notification email and update the stored kernel version.
  4. Repeat from step 2.

Setting up the project

To set up the coding environment we use npm

$ mkdir web_scraper
$ cd web_scraper
$ npm init

This utility will walk you through creating a package.json file.
It only covers the most common items, and tries to guess sensible defaults.

See `npm help json` for definitive documentation on these fields
and exactly what they do.

Use `npm install ` afterwards to install a package and
save it as a dependency in the package.json file.

Press ^C at any time to quit.
package name: (web_scraper)
version: (1.0.0)
description: A web scraper to notify the user when changes in the Linux kernel version appear
entry point: (index.js) app.js
test command:
git repository:
keywords:
author: Nico Greve
license: (ISC)
About to write to /home/nico/prog/package.json:

{
  "name": "web_scraper",
  "version": "1.0.0",
  "description": "A web scraper to notify the user when changes in the Linux kernel version appear",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "Nico Greve",
  "license": "ISC"
}


Is this OK? (yes)

Next we will install some dependencies.

  • nodemon: Restarts our project as soon as we change something in the source code
  • eslint: Linter to error-check our code and to keep it more readable
  • cheerio: HTML parser for static content
  • nodemailer: Send an email by using your email account.

Let's start by installing,cheerio request-promise and nodemailer as production dependencies...

$ npm install --save cheerio request-promise nodemailer

... and nodemon and eslint as development dependencies.

$ npm install --save-dev nodemon eslint

After the installation process it is very likely that you see a log message simliar to this:

. . .
added 498 packages from 368 contributors and audited 2818 packages in 31.326s found 4 vulnerabilities (1 moderate, 3 critical)   run `npm audit fix` to fix them, or `npm audit` for details

This is the audit feature of npm: It compares the installed pacakges against a vulnerabilities database, finds potential security holes and offers, in most caes, a solution. Run npm audit fix to fix the vulnerabilities.

Next we need to tell eslint what our preferred way of code style is:

[21:04:41-nico@linux-user web_scraper]$ node_modules/eslint/bin/eslint.js --init
? How would you like to use ESLint? To check syntax, find problems, and enforce
code style
? What type of modules does your project use? JavaScript modules (import/export)


? Which framework does your project use? None of these
? Where does your code run? Node
? How would you like to define a style for your project? Use a popular style guide
? Which style guide do you want to follow? Standard (https://github.com/standard/standard)
? What format do you want your config file to be in? JavaScript
Checking peerDependencies of eslint-config-standard@latest
The config that you've selected requires the following dependencies:

eslint-config-standard@latest eslint@>=5.0.0 eslint-plugin-import@>=2.13.0 eslint-plugin-node@>=7.0.0 eslint-plugin-promise@>=4.0.0 eslint-plugin-standard@>=4.0.0
? Would you like to install them now with npm? Yes

The last step is to set up our package.json file (in the project root directory) to use eslint and nodemon.  Modify your scripts-Section as shown below. Keep in mind that the  versions of your dependencies can vary. Don't change them, if you do not know what you are doing.

{
 "name": "web_scraper",
 "version": "1.0.0",
 "description": " A web scraper to notify the user when changes in the Linux kernel version appear",
 "main": "src/app.js",
 "scripts": {
   "start": "./node_modules/nodemon/bin/nodemon.js src/app.js --exec 'npm run lint && node'",
   "lint": "./node_modules/.bin/eslint ./src/",
   "test": "echo \"Error: no test specified\" && exit 1"
 },
 "author": "Nico Greve",
 "license": "ISC",
 "devDependencies": {
   "eslint": "^5.16.0",
   "eslint-config-standard": "^12.0.0",
   "eslint-plugin-import": "^2.16.0",
   "eslint-plugin-node": "^8.0.1",
   "eslint-plugin-promise": "^4.1.1",
   "eslint-plugin-standard": "^4.0.0",
   "nodemon": "^1.18.10"
 },
 "dependencies": {
   "cheerio": "^1.0.0-rc.2"
 }
}

Our entry point for our scraper will be the app.js file in the src directory (line 5). So let's create it. Open the fresh created app.js file and enter these two line to test our setup:

// src/app.js

'use strict'
console.log('Hello World')

After editing the app.js file start your project by entering the $ npm run start command. You should see the following output:

$ npm run start
> web_scraper@1.0.0 start /home/nico/prog/web_scraper
> ./node_modules/nodemon/bin/nodemon.js src/app.js --exec 'npm run lint && node'

[nodemon] 1.18.10
[nodemon] to restart at any time, enter `rs`
[nodemon] watching: *.*
[nodemon] starting `npm run lint && node src/app.js`

> web_scraper@1.0.0 lint /home/nico/prog/web_scraper
> eslint ./src/

Hello world
[nodemon] clean exit - waiting for changes before restart

If everything went well, we can now roll up our sleeves and start with the actual web scraper!

Let's Code

If problems occur during this tutorial, consider to take a look at the Github repository.

We already created the app.js file. Let's create the rest of the project structure.
Create the following directory structure:

web_scraper/
  |- src/
    |- app.js
    |- config.js
    |- services/  
      |- ScraperService.js
      |- NotificationService.js
  • app.js - Main entry point of our application
  • config.js - contains configurable parameters
  • ScraperService.js - Performs the actions to fetch the current kernel version
  • NotificationService.js - Sends our notification email when necessary

Maybe it's a bit counterintuitive, but we will not start with the app.js file.  Firstly we need something that can get called within the app.js file,  so we create the ScraperService.js which will get the information about  the current Kernel version.
To hold some static information we create the config.js first:

// web_scraper/src/config.js
'use strict'

module.exports = {
  uri: 'https://www.kernel.org',
  email: {
    user: 'your_login_user',
    pass: 'your_password'
  },
  interval: 5 * 1000 * 60 // in milliseconds
}
  • uri: contains the URL we are going to monitor
  • email: holds the authentication information about your email account
  • interval: defines in which interval the target will be visited

To be clear: To store authentication information in clear text is not recommended! But for the sake of this tutorial we keep it simple ...
Let's continue with the ScraperService.js file. The ScraperService is responsible for fetching the version number:

// web_scraper/src/srvices/ScraperService.js

'use strict'
const config = require('../config')
const cheerio = require('cheerio')
const rp = require('request-promise')

module.exports = {
 async getKernelVersion () {
   const response = await rp({
     uri: config.uri
   })
   const $ = cheerio.load(response)
   const kernelVersion = $('#latest_link').text().trim()
   if (kernelVersion === '') throw new Error('Kernel version not found')
   return kernelVersion
 }
}

The module exports exactly one function - getKernelVersion().

  • Line 10: Get the page defined by config.uri.
  • Line 13: Load the response in a cheerio object, to make it analyzable.
  • Line 14: Search in the cheerio object for a tag with the id #latest_link and get its content, which is the kernel version.
  • Line 15: If no kernel version was found, throw an error.
  • Line 16: If a kernel version was found, return it.

Before  we start with the email notification, you have to know that this thing has a catch. If you are using Gmail, you will probably have a problem with the authentication process. The reason and possible fixes can be  found on the Nodemailer page. So because creating an email account is pretty simple, I just recommend using another email provider.

// web_scraper/src/srvices/NotificationService.js

'use strict'

const config = require('../config')
const nodemailer = require('nodemailer')
const transporter = nodemailer.createTransport({
  host: 'smtp.mail.org', // input your smtp information
  port: 587,
  secure: false, // true for 465, false for other ports
  auth: {
    user: config.email.user,
    pass: config.email.pass
  }
})

module.exports = {
  async sendKernelNotification (version) {
    await transporter.sendMail({
      from: '"Your Name" ',
      to: 'your@email.com',
      subject: `Web_Scraper: Kernel ${version} available!`,
      text: `Kernel ${version} available!\n${config.uri}\nMail sent from web_scraper`
    })
  },
  async sendErrorNotification (err) {
    await transporter.sendMail({
      from: '"Your Name" ',
      to: 'your@email.com',
      subject: `Web_Scraper: An error occured!`,
      text: `Error message:\n\n${err}`
    })
  }
}

The NotificationService has two functions:

  • sendKernelNotification() will be used to tell the user that a new kernel version is available
  • sendErrorNotification() will be used to inform the user about an error occurrence

Don't forget to put your email addresses in the from and to attributes.
Finally, we can use the implemented NotificationService and the ScraperService in the app.js, which is the entry point of our web scraper:

// web_scraper/src/app.js

'use strict'
const config = require('./config')
const ScraperService = require('./services/ScraperService')
const NotificationService = require('./services/NotificationService')

var latestVersion = ''

async function run () {
  try {
    const version = await ScraperService.getKernelVersion()
    console.log(new Date().toString(), `Current Kernel Version: ${version}`)
    if (latestVersion.length === 0) {
      /* first start of the web scraper */
      latestVersion = version
    } else if (latestVersion !== version) {
      /* Kernel version changed. Send email to user. */
      latestVersion = version
      await NotificationService.sendKernelNotification(version)
    }
    /* Recall function after  milliseconds */
    console.log('Next check: ', new Date(Date.now() + config.interval).toString())
    setTimeout(run, config.interval)
  } catch (err) {
    /* exit application if something went wrong */
    try {
      await NotificationService.sendErrorNotification(err)
    } catch (err) {
      console.error('Sending error notification failed!')
    }
    console.error(err)
    console.log('Program exit')
    process.exitCode = 1
  }
}

run()

In the top we import our three modules: config, ScraperService and the NotificationService.

We create the global variable latestVersion, which we will use to compare it against the freshly fetched version.

Let's start with the try-block:

  • Line 12: We use our ScraperService to fetch the current kernel version.
  • Line 13: We display the current time and the kernel version we just fetched.
  • Line 14: With this if statement we check if it's the first fetch.
  • Line 17: If this is not the first start and the latestVersion differs from the freshly fetched version we will send an email by using the NotificationService.
  • Line 22-23: Display when the next fetching process will start and set the timeout for the next round.

The catch block fires when something gets wrong, in which case we will also receive an email about the error.

To start the scraper just navigate to the root directory of the project and enter

npm run start

Or just try this sandboxed online editor. The first start takes some time (~1 min), because the packages have to be installed. Changes of the code are only visible to you. Just hit the play button at the top!

web_scraper
Repl.it is a simple yet powerful online IDE, Editor, Compiler, Interpreter, and REPL. Code, compile, run, and host in 50+ programming languages: Clojure, Haskell, Kotlin (beta), QBasic, Forth, LOLCODE, BrainF, Emoticon, Bloop, Unlambda, JavaScript, CoffeeScript, Scheme, APL, Lua, Python 2.7, Ruby, R…