Crawling GitHub with Promises: ArangoDB Tutorial
The new Javascript driver no longer imposes any promises implementation. It follows the standard callback pattern with a callback using err
and res
.
I wanted to give the new driver a try. A github crawler seemed like a good side-project, especially because the node-github driver follows the same conventions as the Javascript driver.
There are a lot of promise libraries out there. The most popular one – according to NPM – was promises. It should be possible to use any implementation. Therefore I used this one.
The following source code can be found on github.
Pagination with Promises made easy
The github driver has a function to get all followers. However, the result is paginated. With two helper functions and promises it is straight forward to implement a function to retrieve all followers of an user.
function extractFollowers (name) {
'use strict';
return new Promise(function(resolve, reject) {
github.user.getFollowers({ user: name }, promoteError(reject, function(res) {
followPages(resolve, reject, [], res);
}));
});
}
The followPages
function simply extends the result with the next page until the last page is reached.
function followPages (resolve, reject, result, res) {
'use strict';
var i;
for (i = 0; i < res.length; ++i) {
result.push(res[i]);
}
if (github.hasNextPage(res)) {
github.getNextPage(res, promoteError(reject, function(res) {
followPages(resolve, reject, result, res);
}));
}
else {
resolve(result);
}
}
The promote error helper is a convenience function to bridge callbacks and promises.
function promoteError (reject, resolve) {
'use strict';
return function(err, res) {
if (err) {
if (err.hasOwnProperty("message") && /rate limit exceeded/.test(err.message)) {
rateLimitExceeded = true;
}
console.error("caught error: %s", err);
reject(err);
}
else {
resolve(res);
}
};
}
I’ve decided to stick to the sequence reject
(aka err
) followed by resolve
(aka res
) – like the callbacks. The promoteError
can be used for the github callback as well as the ArangoDB driver.
Queues, Queues, Queues
I’ve only needed a very simple job queue, so queue-it is a good choice. It provides a very simple API for handling job queues:
POST /queue/job
POST /queue/worker
DELETE /queue/job/:key
The new Javascript driver allows to access arbitrary endpoint. First install a Foxx implementing the queue microservice in an ArangoDB instance.
foxx-manager install queue-it /queue
Adding a new job from node.js is now easy
function addJob (data) {
'use strict';
return new Promise(function(resolve, reject) {
db.endpoint("queue").post("job", data, promoteError(reject, resolve));
});
}
Transaction
I wanted to crawl users and their repos. The relations (“follows”, “owns”, “is_member”, “stars”) is stored in an edge collection. I only add an edge if it is not already there. Therefore I check inside a transaction, if the edge exists and add it, if it does not.
createRepoDummy(repo.full_name, data).then(function(dummyData) {
return db.transaction(
"relations",
String(function(params) {
var me = params[0];
var you = params[1];
var type = params[2];
var db = require("org/arangodb").db;
if (db.relations.firstExample({ _from: me, _to: you, type: type }) === null) {
db.relations.save(me, you, { type: type });
}
}),
[ meId, "repos/" + data._key, type ],
function(err) {
if (err) {
throw err;
}
return handleDummy(dummyData);
});
})
Please note that the action function is executed on the server and not in the nodejs client. Therefore we need to pass the relevant data as parameters. It is not possible to use the closure variables.
Riding the Beast
Start an ArangoDB instance (i.e. inside a docker container) and install the simple queue.
foxx-manager install queue-it /queue
Start the arangosh and create collections users
, repos
and relations
.
arangosh> db._create("users");
arangosh> db.users.ensureHashIndex("name");
arangosh> db._create("repos");
arangosh> db.repos.ensureHashIndex("name");
arangosh> db._createEdgeCollection("relations");
Now everything is initialized. Fire up nodejs and start crawling.
node> var crawler = require("./crawler");
node> crawler.github.authenticate({ type: "basic", username: "username", password: "password" })
node> crawler.addJob({ type:"user", identifier:"username" })
node> crawler.runJobs();
Please keep in mind that this is just an experiment. There is no good error handling and convenience functions for setup and start. It is also not optimized for performance. For instance, it would easily be possible to avoid nodejs / ArangoDB roundtrips using more transactions.
Sources used in this example:
- ArangoJS https://github.com/arangodb/arangojs
- Node Github https://github.com/ajaxorg/node-github
- npm promises https://www.npmjs.com/package/promises
- ArangoDB Foxx queue-it https://github.com/arangodb/queue-it
The source code of this example is available from Github: https://github.com/fceller/Foxxmender
If you want to continue with other JavaScript related resources, you should start with ArangoDB NoSQL and JavaScript.
2 Comments
Leave a Comment
Get the latest tutorials, blog posts and news:
There’s a small typo (users -> should be -> repos):
arangosh> db._create(“repos”);
arangosh> db.repos.ensureHashIndex(“name”);
You are totally right. Fixed the typo, Thanks