The rampart-robots module

Preface

Acknowledgment

The rampart-robots module uses Google’s Robotstxt library. The authors of Rampart extend our thanks to the authors and contributors to this library.

License

The Robotstxt library is licensed under the Apache 2.0 License. The rampart-robots module is released under the MIT license.

What does it do?

The rampart-robots module checks a given URL against a robots.txt file to determine whether crawling the content is allowed. For a background on the purpose of the robots.txt file, a good primer can be found here.

How does it work?

The rampart-robots module exports a single function which takes as its input, a robots.txt file, a user agent string and a URL and returns an answer indicating whether the download of the url by the user agent is allowed by the rules set forth in the robots.txt file.

Loading and Using the Module

Loading

Loading the module is a simple matter of using the require() function:

var robots = require("rampart-robots");

Main Function

The rampart-robots module exports an Object with a single function: isAllowed().

isAllowed()

The isAllowed function takes three Strings, the text of a robots.txt file (may also be a buffer), a user agent string, and a URL.

Usage:

var robots = require("rampart-robots");

var res = robots.isAllowed(user_agent, robotstxt, url);

Where:

  • user_agent is a String, the name of the user agent to check.
  • robotstxt is a String or Buffer, the contents of a robots.txt file.
  • url is a String, the URL of the resource to be accessed.
Return Value:
A Boolean - true if access of the URL is allowed by the robotstxt rules, or false if disallowed.

Example

var robots = require("rampart-robots");
var curl = require("rampart-curl");

var agent = "myUniqueBotName";
var rtxt=curl.fetch("https://www.google.com/robots.txt", {"user-agent": agent});
var url1 = "https://www.google.com/";
var url2 = "https://www.google.com/search?q=funny+gifs";

if(rtxt.status == 200) {
    var res1 = robots.isAllowed(agent, rtxt.body, url1);
    var res2 = robots.isAllowed(agent, rtxt.body, url2);

    /* expected results:
        res1 == true
        res2 == false
    */
} else {
    console.log("Failed to download robots.txt file with status:" + rtxt.status);
}