Social Network Analysis (SNA)

Social Network Analysis (SNA) – Architecture

Architecture Overview

The Social Network Analysis architecture has two major aspects:

Analysis Data Management
Analytics Computation

The Analysis Data Management collects the transactional data from the system that manages the Social Network, like an Enterprise 2.0 system, a discussion group management system or other systems that manage production data from which a Social Network can be extracted. Famous systems are Facebook or LinkedIn, however, those systems do not allow the access to their complete Social Network or the extraction thereof. In addition, the Analysis Data Management establishes a consistent data set (usually in specified intervals) for the analytics computation. This ensures that all analytics computation is based on the same data set for consistency. Finally, it manages the results of the analytics computation and makes them available to the software that provides access to the analytics results for users or analysts.

The Analytics Computation is the set of all analytics queries (algorithms) that are executed on a consistent data set. These algorithms can be of various types, can run quickly or take up significant time, and independent as well as dependent on each other (to avoid re-computation of already available data).

Performing Analytics Computation in regular intervals is the most common approach due to the amount of data and the length of the processing computing time required. Computing analytics queries on user access with low latency is not realistic in the general case. However, in specific circumstances, it might be feasible to execute some analysis algorithm on user access dynamically. In this case, however, it must be clear to the user that the data set used for instantaneous computation might be different from the last periodic analytics computation execution. It is possible that in this case discrepancies exist when the data set evolved enough to have certain results be different.

Top

Architecture

The Social Network Analysis architecture consists of several interacting components. These are shown in the following figure. The various phases and modes of analytics are introduced afterwards.

Top

Execution Life Cycle

Social Network Analysis is performed on historical transactional data coming from production systems, like Social Network systems or Enterprise 2.0 systems. The data managed by these systems (e.g., social relationships, user accounts, ratings, blogs, wikis, etc.) is changing constantly and the amount of data is usually increasing. The data of the production systems are the source for the Social Network Analysis (and in general there can be more than on production system that feeds data into the Social Network Analysis system).

The overall execution life cycle of Social Network Analysis (SNA) starts with collecting the data updates coming from production systems:

Phase 1: Data Collection. This phase is an ongoing phase in the sense that the stream of data from the production systems is continuous and the SNA system needs to ensure that all incoming data are reliably stored (independent of the arrival rate).
Phase 2: Data Preparation. This phase prepares a snapshot of the collected production data. The snapshot is the basis for a complete execution of the various analytics queries. Since all analytics queries are run on the same snapshot, the outcome of each of the analysis is consistent with respect to each other.
Phase 3: Analysis Execution. As soon as the snapshot is made available, the analysis starts and executes all analytics queries completely.
Phase 4: Result Finalization. After the analysis execution phase the results are made available to the user interface for end users and analysts to examine the analysis results.

Phase 2, 3 and 4 are iterative and are repeated many times. One possible strategy is to create snapshots by the data preparation phase on a fixed schedule (like once a day), or on a "as soon as possible" strategy. In principle, as soon as a snapshot is prepared, another one can be prepared in parallel, and consequently the analysis execution phase can be run in parallel, too. The schedule and degree of parallelization is determined by the needs of the company. However, the available space for data and the available computational power must be increased significantly in case of parallelism.

The final SNA Analysis Data produced by Phase 4 is being replaced with the result of each subsequent scheduled execution so that end users and analysts always operate on the latest data set.

Top

Execution Mode

Traditionally analytics is performed on historic data, meaning, a snapshot of the production data or the historic data up to a point in time was basis for analytics. This resulted in "off-line" analytics where the result of the analytics queries are on a data set that is not completely actual as compared to the transactional data of the production systems. However, in Social Network Analysis, this is not necessarily the only appropriate strategy due to the user behavior.

Users, when for example adding a friend relationship, immediately afterwards want to check how their network reach increased. In this use case it does not make sense to compute the network reach off-line as the user would have to wait until the analytics execution phase. Instead, users demand "on-line" analytics that is performed on the actual state of the production data set.

Therefore it is very important to distinguish the modes of analysis:

Off-line Analytics. The analytics queries are executed based on a consistent and historic data set.
On-line Analytics. The analytics queries are executed based on an actual data set that is consistent with the current state of the production data.

For each Social Network Analytics query it must be decided, if the query is available in off-line mode or on-line mode (or both). This has an implication of the necessary computational power as well as storage space available.

Top

Technology

The technology deployed was changed during the course of the project. There were two phases:

Phase 1: Semantic Technologies

In the initial phase a stack of Semantic Technologies was used. It was comprised of a semantic database, SPARQL and a development environment specialized for modeling semantic structures and SPARQL queries on RDF and RDF/S.

The network data were represented as RDF triples and transferred over from a transaction system. The network analysis was implemented as a series of SPARQL queries in the semantic database. The result was made available as various types of graphs that were incorporated into the transaction system.

Phase 2: Relational Technologies

The second phase re-implemented the social network analysis algorithms based on a relational database management system. The network data were in relational form in the transaction system. While the semantic technologies provided a strong support for the functionality, the technology integration in order to integrate the semantic technologies was considered to be too costly. The results from phase 1 were therefore transferred over to the existing relational technology stack.

Top

Research Questions

In context Social Network Analysis, interesting research questions can be asked:

Social Network Analytics Language. Is it possible and necessary to have a Social Network Analytics execution language that allows to declaratively formulate analytics queries instead of using a programming model and language?
Continuous Social Network Analytics. Is is possible to to have ongoing analytics in Social Networking so that the analytics queries are continuously executed instead of a snap-shot based defined data state of the past?
Delta Computation. Instead of re-running analytics queries over changed data, is is possible to determine the change in analytics results based on data changes?
Separation of Concerns. Instead of executing all analytics queries in the analysis execution phase, is it possible to separate the concerns that only specific data sets are created, and the specific queries run on demand?

For example, instead of computing the set of mutual linked users for each given user, it is possible to only compute the mutual links for the whole network, and derive the mutual linked network for a given user on demand only.
The research question here lies in determining the right trade-off between what is pre-computed and what is computed online.