raw database

In a previous article, I've already talked about an optimized way to connect locations in a geographic point of view by using MySQL. In this manner, locations of pubs, drugstores, barbers or even users can be obtained. Communities, or perhaps I should use the newer term Social Networks, make use of the buddy network of indiviual members in addition to the geographical mapping. This has many psychological advantages, because new members can be integrated in an established network very easily and I'm more willing to become involved when I already know some of the members.

When it comes to project such a network on the comupter, every member can be seen as node inside of a graph. I think one could think about a NoSQL graph databases like Neo4j at this point, but a simple adjacency list implemented in MySQL can do the job also pretty well.

Another theoretical question is whether we should use a directed or an undirected graph to store the nodes. I decided to use both. In practice, I use one table as directed graph with some additional meta information, such as the date of when the friendship was sealed and another table which holds an optimized set of undirected relationships. With optimized, I mean a table with maybe a different index configuration or a reduced data set, even if I use the exact same data and configuration for every table in the examples below.

I'll use the following statement to create the first simple table which stores the directed adjacency list with the option for some meta information about the releationships:

CREATE TABLE buddy (
  UID_FROM INT UNSIGNED NOT NULL,
  UID_TO INT UNSIGNED NOT NULL,
  BDate TIMESTAMP NOT NULL,
  BType INT UNSIGNED NOT NULL,
  PRIMARY KEY (UID_FROM, UID_TO),
  KEY(UID_FROM),
  KEY(UID_TO),
  KEY(UID_FROM, BDate),
  KEY(UID_TO, BDate)

);

This table has all possebilities to work with the direct milieu of a member. You can already:

As we see, the buddy table does not cover every problem neatly. There are cases where an directed graph isn't that handy. Every problem with an asterisk needs an UNION which would slow down the whole thing. As I already said, I'll add an additional table and produce a big redundance to be able to write easier queries. The new table is transparently managed with a trigger (which could get some logic to reduce the complexity of the table by not used rows):

CREATE TABLE buddy_0 (
  UID_LVL1 INT UNSIGNED NOT NULL,
  UID_LVL5 INT UNSIGNED NOT NULL,
  PRIMARY KEY (UID_LVL1, UID_LVL5),
  KEY(UID_LVL1),
  KEY(UID_LVL5)
);

The corresponding trigger looks like this:

CREATE TRIGGER inBuddy BEFORE INSERT ON buddy
FOR EACH ROW
  INSERT INTO buddy_0 VALUES
	(NEW.UID_FROM, NEW.UID_TO),
	(NEW.UID_TO, NEW.UID_FROM);

In order to obtain all the friends of my friends, I need a self join on the buddy table. This game could be continued, that people from each of these in turn are the base of a new friendship determination. The theory states that at a level of 6 is enough to know everyone on the world through any friend path. For this reason we stop at level 5, since otherwise we could use a simple cross join and the data would be very meaningless. I put together an own view for each recurrence level, which defines directly the otimization plan:

CREATE VIEW buddy_1 AS
SELECT A.UID_LVL1 UID_LVL1,
       0 UID_LVL2,
       0 UID_LVL3,
       0 UID_LVL4,
       A.UID_LVL5 UID_LVL5,
       1 BLevel
FROM buddy_0 A;

CREATE VIEW buddy_2 AS
SELECT A.UID_LVL1 UID_LVL1,
       B.UID_LVL1 UID_LVL2,
       0 UID_LVL3,
       0 UID_LVL4,
       B.UID_LVL5 UID_LVL5,
       2 BLevel
FROM buddy_0 A
JOIN buddy_0 B ON A.UID_LVL5  = B.UID_LVL1
              AND B.UID_LVL5 <> A.UID_LVL1;

CREATE VIEW buddy_3 AS
SELECT A.UID_LVL1 AS UID_LVL1,
       A.UID_LVL2 AS UID_LVL2,
       B.UID_LVL1 AS UID_LVL3,
       0 AS UID_LVL4,
       B.UID_LVL5 AS UID_LVL5,
       3 AS BLevel
FROM buddy_2 A
JOIN buddy_0 B ON A.UID_LVL5  = B.UID_LVL1
              AND B.UID_LVL5 <> A.UID_LVL1
              AND B.UID_LVL5 <> A.UID_LVL2;

CREATE VIEW buddy_4 AS
SELECT A.UID_LVL1 AS UID_LVL1,
       A.UID_LVL2 AS UID_LVL2,
       A.UID_LVL3 AS UID_LVL3,
       B.UID_LVL1 AS UID_LVL4,
       B.UID_LVL5 AS UID_LVL5,
       4 AS BLevel
FROM buddy_3 A
JOIN buddy_0 B ON A.UID_LVL5  = B.UID_LVL1
              AND B.UID_LVL5 <> A.UID_LVL1
              AND B.UID_LVL5 <> A.UID_LVL3
              AND B.UID_LVL5 <> A.UID_LVL2;

Maybe it's annoying to administrate views or stored procedures on a big cluster, but it helps writing clean and maintainable code and reduces the number of bytes to be transfered for every query. Sure, there are also other problems with views but I used it here for a simpler and transparent demonstration.

Let's have a look on two simple examples. To get the friends of my friends, it is sufficient to use level 2 of our view hierarchy:

SELECT UID, UName, UPic, GETAGE(UBday) UAge
FROM buddy_2
JOIN user ON UID_LVL5 = UID
WHERE UID_LVL1 = 123

The stored function GETAGE(), I used in this example, is one I've already published. Okay, now this query could be extended by an exclude join to find all FOF, I'm not already connected with:

SELECT fof.*
FROM (
   SELECT UID_LVL5
   FROM buddy_2
   WHERE UID_LVL1=123
) fof
LEFT JOIN (
   SELECT UID_LVL1, UID_LVL5
   FROM buddy_0
   WHERE UID_LVL1=123
) friend ON fof.UID_LVL5=UID_LVL1
WHERE friend.UID_LVL5 IS NULL;

Finding a friend path is also no problem with this setup anymore. We loop over the views, from the first to the fifths, and use the first match where a connection could be established. The loop could also be implemented in a stored procedure to avoid unnecessary traffic between the application and database server.

Thanks to Daniel Niedergesäß, who gave the underlaying idea to the approach.