Advanced Scalability

Introduction

Hdiv supports scalability and high performance by providing a versatile framework for configuring the persistence of security-related data. The framework is designed as a chain of caches which can be configured and selected to suit the requirements of each application.

Depending on the client’s requirements, the number of Hdiv enabled applications and the hardware available, different approaches can be taken by configuring faster options or others that are more memory-optimized.

A more detailed explanation will be provided later but briefly, the following cache types can be used:

  • Memory caches
  • In-Memory Database caches
  • Relational Database caches

They can also be classified depending on their scope:

  • User caches
  • Single application caches
  • Shared caches

Finally, they can also be divided into those that keep the information in raw format or compressed to save memory:

  • Raw format caches
  • Compressed caches

Architecture

Requirements & Constraints

The type of applications that Hdiv might handle is enormous, and they have a variety of different requirements and constraints, therefore the architecture needs to handle different scenarios, optimizing throughput while keeping the memory footprint as low as possible. The following considerations were important while designing cache implementation:

Persistence requirements

Depending on the application, the requirements for security data persistence could be different. By default, Hdiv promotes the persistence of all data, but other approaches are also available if the client adopts a simpler architecture and accepts losing obsolete/old data that is (probably) no longer needed.

There is also room for configuring the persistence in different terms. For example, a client may choose to have a reserved cache size for each user, while another might prefer to have a reliable total size by configuring a persistence that is shared between the different users.

Memory/CPU requirements

Different applications may be more CPU bound or memory bound. Hdiv is designed to handle this by providing a fine-grained configuration.

In-Memory caches allow first class performance at the cost of higher memory usage, while compressed in-memory caches keep most of the advantages of the previous, but reduce the memory usage by a factor of 3-4. Finally, external database caches move the data persistence to another (probably shared) resource.

Scope of the data

Traditional monolithic applications do not usually have jumps between different applications, however, new architectures, like those based on microservices, promote this. Depending on the situation, security could be private to one application or may need to be shared between all of them, so that each application knows that a particular link was legally created by another.

The same applies if the application is deployed in a cluster and no session affinity is configured, so that a request from a user can be handled by one server and the next request handled by another.

Top level design

Hdiv caches work as a linked list of caches and the number of cache levels is configurable. Whitelist data is pushed by the application through the first level, and it progresses depending on the current status of each cache level and the characteristics of the data itself.

While the data moves forward through the caches, it may be compressed or sent to a shared cache. The design guarantees that:

  • Data is only present in one cache level : This is true in general, even though it may be duplicated while the data is progressing to the next cache level
  • Data only moves forward to the deepest levels of the cache chain : For simplicity reasons, the cache chain is designed in a way that the application always moves forward to the next deeper level. It may be retrieved by the application for validating but it is kept in the same level
  • Once a cache level is compressed, the following ones will also be compressed : There could be one compression moment only. Caches before that point will keep the data in raw format and the ones after that will store the data in compressed format.
  • Once a cache level is shared, the following ones will also be shared : In the same way, when a cache is shared, those after will also be shared. A shared cache is one that is accessible by different applications
  • Shared data is inserted in the first shared cache level : When shared data is inserted in a cache chain (a link that jumps between applications) it is pushed to the first shared cache level so that it is directly available for other applications to use.

An application is properly configured when:

  • Most of its data accesses are handled by the first cache level : The data required for validation should ideally be retrieved from an in-memory first cache level so that no conversions or queries are needed.
  • Memory usage is kept as low as possible : It is important to conform with the previous point while keeping the memory usage as low as possible for the current application.
  • Deepest levels are rarely accessed : Later cache levels should be a backup for unusual situations and very rarely used. Ideally, only shared data should work in deep cache levels

While these are the desired characteristics, real-life applications do not usually conform to an ideal situation, so caches should be configured to obtain the optimum performance for the application requirements.

User vs Per application vs Shared caches

Hdiv’s scalable design is able to handle both application and shared caches. An application cache is accessible only for the application that created the information. On the other hand, shared caches allow the data to be accessible between different applications transparently.

If only one single application is present, shared caches may not be needed, but in most cases they provide the required communication mechanism between the different apps. So for example, links created in one application which are able to jump to another, are considered valid.

A special case is User caches. In these caches the data is stored for each user, allowing a fine-grained configuration of the persistence level, taking into account each user. Unfortunately, although it works perfectly for small applications, it is not as scalable as other approaches.

Raw vs Compressed caches

Another difference between caches is how they store the security data information and they can be divided into raw and compressed caches accordingly.

  • Raw caches: The information stored in them is kept in raw format and directly usable by the application. Their performance is optimum but their memory usage is higher. As already explained, only the first cache levels keep the data in raw format
  • Compressed caches: These caches optimize memory usage with a small CPU cost. The data inside them is stored in a compressed format, which allows saving 3-4 times the memory usage. The deepest cache levels use this approach.

Fault recovery

Advanced scalability caches are designed for fault recovery, all kind of persistent layers (SQL, NoSQL and InMemory databases) are fault prove, therefore if a database gets suddenly unavailable Hdiv will not stop working. The behaviour in those situations is to continue working in the same way, but as some of the validation details will not be available validation will be disabled while the outage is in place.

Later when the persistent storage is ready again Hdiv will continue validating in the normal way transparently and without application users even noticing it.

Cache Types

User session caches

A special type of in-memory cache which keeps the information in data structures associated with the user who created them. They allow fine-grained user-related configuration, but their scalability problems make them unsuitable for high load environments.

Their behaviour is slightly different than the rest of the caches, as they break the rule of information being stored in only one cache. Instead, they handle the data in parallel with the main cache chain.

In-Memory caches (IMC)

In-memory caches are a versatile type of cache that support several working modes, allowing fine-grained configuration for optimizing application performance.

The following parameters are designed to be customizable in this type of cache:

  • Data format: In-Memory caches can work both with raw data or compressed data
  • Scope: Usually In-Memory caches work as a first level application cache. However, they can be configured to work as a shared resource throughout the application server. Note that they can not be used as a shared resource in cluster environments
  • Size: It is possible to select the amount of information that can be retained in these caches
  • Batch size: When a cache reaches its maximum size, a batch of information is moved to the next cache level. By default, the batch size is the same as the cache size, but it should be customized. A good ratio for example, is for the batch size to be an order or magnitude smaller than the size 1000 vs 100

In-Memory database caches (IMDC)

In many cases, the persistence of security-related data is better handled outside the application server itself. One of the current implementations uses a cache to store data in a REDIS in-memory database. This cache is always:

  • Shared: Several applications can access the same database, in contrast to In- Memory caches. This allows real shared caches that can work even in cluster environments

  • Compressed: The data stored in the database is always compressed

REDIS is a database engine that stores data in key-value format, in memory, saving its state periodically. REDIS works in client/server mode as a service handling requests and allows connections from any other hosts. It can also work in a cluster of REDIS servers, so it is easily scalable and can be configured to automatically delete data when a particular limit is reached.

As REDIS has been designed for speed, it obtains good performance (<0.5ms).

If present, it is usually configured as the latest level of cache. However, it can sometimes be configured to overflow into the next level of cache (a relational database).

The only disadvantage is that it is a relatively new technology, and it may be difficult to introduce in some production environments as a certification process may be needed before installation is possible.

NoSQL database caches (NSDC)

When flexible scalability is required NoSQL databases are the better option due to their built-in cluster features. These caches are always:

  • Shared: Several applications can access the same database, in contrast to In- Memory caches. This allows real shared caches that can even work in cluster environments

  • Compressed: The data stored in the database is always compressed

Two types of NoSQL databases are currently supported: MongoDB and Cassandra. Their performance is usually worse than REDIS-backed caches but better than relational database caches, while being flexible to scalability changes. If present, it is the latest level of cache.

Relational database caches (RDC)

The last type of cache is one that is backed up by a relational database which the application may be using to store business or Hdiv data exclusively. Its features are the same as a REDIS-backed cache:

  • Shared: Several applications can access the same database, in contrast to In- Memory caches. This allows real shared caches that can even work in cluster environments
  • Compressed: The data stored in the database is always compressed

If present, it is always the last cache level. Any database that implements a JDBC driver is supported as the backup of this cache level and database sharing is supported for improving the overall performance. By default this cache saves the information in a single table, but a group of tables can be configured for improving the performance (the resulting tables will be smaller).

Cache configuration examples

The cache chain is completely configurable as none of the cache types is mandatory and they can be combined freely. Some example configurations are provided below.

Pure memory I: Single IMC

If only one application runs on the server and its usage is not extensive, the simplest configuration can be used: enable a single, raw, in-memory cache.

Pure memory II: Raw & Compressed IMC

Another valid combination is to mix a small raw IMC with a bigger compressed IMC to allow more storage of old data without using more memory.

Simplest shared configuration: Single Database Cache

If several applications are present or we do not want to lose any data, providing the workload is not too high, a single database cache can be configured. REDIS or a traditional Relational database can be used (the second option is more common in general).

Average configuration: IMCs + RDC

The most common configuration is a mixture of IMCs and an RDC. The memory cache helps to achieve fast performance while the RDC keeps all the required data without losing any register and allows information sharing between different applications.

A similar configuration is to replace the RDC with an IMDC and this approach allows better performance.

Average configuration: IMC (Shared) + RDC

In this example, traditional IMCs are replaced by one shared IMC for the whole server. This approach could be interesting if many Hdiv applications are deployed in the same application server. The main advantage is that a lot of memory can be saved by keeping a single cache for all the applications.

Activating user caches

User caches can work in parallel with the cache chain. They can be activated with a defined number of pages per user, according to their individual data storage needs.

Complex configuration (IMC + IMDC + RDC)

The most complex configuration for caches implies using IMC, a REDIS database that overflows into the latest cache level, backed up with an RDC.

IMDC details

REDIS configuration

This cache type uses REDIS as its storage system. A / the REDIS server can be configured in this way:

./redis-server redis-1.conf

In the redis-1.conf configuration file, the following values can be modified:

Property Description

port 6379 REDIS listener port
daemonize no Whether REDIS should work as a daemon or not
databases 1 Number of databases to be created
save 10 1 Create a backup if more than one row was modified in the 10s slot
dbfilename file.rdb Backup file name
requirepass hdiv Password for database clients
maxmemory Maximum amount of memory to be used by REDIS
maxmemory-policy
volatile-ttl
Cleanup policy to be applied when the database reaches its
maximum allowed memory usage
maxmemory-
samples 10
Number of samples to be retrieved before applying automatic
cleanup algorithm

Data storage

REDIS servers store data as key-value pairs. Hdiv saves its data in two types of registers:

Key Value
user:1278AB12 user:1278AB
user:5464CB12 user:5464CB
1278AB12:1 Byte(Page(1278AB12-1))
1278AB12:2 Byte(Page(1278AB12-2))
1278AB12:3 Byte(Page(1278AB12-3))
5464CB12:1 Byte(Page(5464CB12-1))

5464CB12:2 Byte(Page(5464CB12-2))
... ...
As we can see, two types of registers are present:

  • User registers: Saved for storing active user information if REDIS cache level is the latest one
  • Page/Data registers: Associated to one particular user, they store information about a single security data info object, usually called Page. The size of the value column for each object depends on the information present but a typical size is around 1Kb

Clustering

Hdiv’s In-Memory database caches support clustering for higher scalability needs. REDIS clustering is done using Twenproxy. This library adds clustering and load balancing features to REDIS server clusters.

It provides a GUI to monitor server instances: nutcracker-web

Configuration:

./nutcracker -c nutcracker.yml -d

  • d, --daemonize : run as a daemon
  • c, --conf-file=S : set configuration file (default: conf/nutcracker.yml)

nutcracker.yml configuration file will be like this:

alpha:
listen: 127.0.0.1:
hash: fnv1a_
distribution: ketama
auto_eject_hosts: true
redis: true
server_retry_timeout: 2000
server_failure_limit: 1
servers:
- 127.0.0.1:6379:
- 127.0.0.1:9379:

In this file, two REDIS instances are being configured; 127.0.0.1:6379 and 127.0.0.1:9379, and the port for clients to connect to the proxy is 22121.

RDC details

This cache type uses a relational database system as its storage system.

Supported databases

Any database that can be accessed using a JDBC driver should work without problems with Hdiv, nevertheless the following database types have been tested already:

  • Oracle
  • PostgreSQL
  • MySQL
  • SQLServer
  • DB2
  • H2

Although SSD drives are preferable, Hdiv was also tested on Magnetic drives without big impact on performance.

Performance

The Hdiv data model is optimized to be as simple as possible by being stored in a single database table. This allows it to be easily integrated in existing business logic databases while providing fast performance.

This approach works perfectly for most applications, but for those with a high number of requests, activating Sharding features is recommended.

Hdiv supports database sharding by splitting the info between different tables, which provides smaller tables that are faster to access. The number of database tables to be used by Hdiv can be configured from 1 to 256.

Database sample schemas

Oracle 11G
CREATE TABLE HDIV_PAGES (
    "IDUSER" VARCHAR2(8) NOT NULL ,
    "IDPAGE" VARCHAR2(48) NOT NULL ,
    "PAGE" BLOB ,
    "TIMESTAMP" TIMESTAMP NOT NULL ,
    CONSTRAINT "PK_HDIV_PAGES" PRIMARY KEY ("IDUSER", "IDPAGE")
)
PostgreSQL
CREATE TABLE HDIV_PAGES (
    iduser character varying (8) NOT NULL ,
    idpage character varying(48) NOT NULL ,
    page bytea ,
    "timestamp" timestamp without time zone NOT NULL ,
    CONSTRAINT pk_HDIV_PAGES PRIMARY KEY (iduser, idpage)
)
Hsqldb
create table HDIV_PAGES (
    iduser char varying (8) not null ,
    idpage char varying (48) not null ,
    page LONGVARBINARY ,
    timestamp timestamp not null ,
    constraint pk_HDIV_PAGES primary key (iduser, idpage)
)
DB2
CREATE TABLE HDIV_PAGES (
    iduser VARCHAR (8) not null ,
    idpage VARCHAR (48) not null ,
    page BLOB ,
    timestamp TIMESTAMP not null ,
    constraint pk_HDIV_PAGES primary key (iduser, idpage)
)
MySQL
CREATE TABLE HDIV_PAGES (
    iduser VARCHAR (8) not null ,
    idpage VARCHAR (48) not null ,
    page BLOB ,
    timestamp TIMESTAMP not null ,
    constraint pk_HDIV_PAGES primary key (iduser, idpage)
)
SQLServer
CREATE TABLE HDIV_PAGES (
    iduser VARCHAR (8) not null ,
    idpage VARCHAR (48) not null ,
    page varbinary ,
    timestamp DATETIME2 not null ,
    constraint pk_HDIV_PAGES primary key (iduser, idpage)
)

Hdiv Configuration

Hdiv advanced scalability can be configured in three ways:

XML

Several examples are provided:

<hdiv:externalStateStorage>
    <hdiv:cacheConfig>
        <hdiv:cache type="SHARED" size="1000" batch="100" />
        <hdiv:cache type="COMPRESSED" size="2000" batch="200" />
        <hdiv:cache type="EXT_DB" />
    </hdiv:cacheConfig>
    <hdiv:databaseExternalStateStore
        numberOfTables="2" tablesSubjectName="hdiv_pages_"
        jndiDataSourceLookup="java:/comp/env/jdbc/SampleDS" />
</hdiv:externalStateStorage>

In this case, 3 cache levels are configured:

  • SHARED Raw IMC: Space for 1000 pages, on overflow it saves a batch of 100 pages
  • COMPRESSED Compressed IMC: 2000 pages saved as byte arrays can be retrieved with batch size of 200
  • EXT_DB RDC: Database with 2 sharding tables, hdiv_pages_0 and hdiv_pages_1 from java:/comp/env/jdbc/SampleDS JNDI Datasource

<hdiv:externalStateStorage>
    <hdiv:cacheConfig>
        <hdiv:cache type="EXT_MEMORY" />
    </hdiv:cacheConfig>
    <hdiv:redisExternalStateStore host="localhost" port="6379" password="redis" maxPool="15" expireTime="1200" />
</hdiv:externalStateStorage>
A single level is selected in this case - an IMDC with a REDIS server installed in localhost.

<hdiv:cacheConfig>
    <hdiv:cache type="USER" />
    <hdiv:cache type="SHARED" size="1000" />
    <hdiv:cache type="EXT_DB" />
</hdiv:cacheConfig>

<hdiv:externalStateStorage>
    <hdiv:databaseExternalStateStore
        dataSourceRef="datasource" numberOfTables="4" tablesSubjectName="hdiv_pages_table_" />
</hdiv:externalStateStorage>

<bean id="datasource" class="org.apache.commons.dbcp.BasicDataSource"
    destroy-method="close">
    <property name="driverClassName" value="org.postgresql.Driver" />
    <property name="url" value="jdbc:postgresql://localhost/hdiv-ee-external" />
    <property name="username" value="hdiv" />
    <property name="password" value="hdiv-enterprise" />
</bean>

Two cache levels, first an IMC with 1000 raw pages, and a 1000 batch size (same value as size by default), then an RDC, with a JDBC datasource configuration and 4 sharding tables named hdiv_pages_table_[0-3]. Additionally USER cache is enabled.

CacheConfig Properties:

<hdiv:cache type="SHARED" size="1000" batch="100" />
  • Cache.type: USER, SHARED, COMPRESSED, SERVER_SHARED,EXT_MEMORY, EXT_DB
  • Cache.size: Size of the cache. Optional attribute
  • Cache.batch: Size of the batch window. Optional attribute, by default the same value as size

RedisExternalStateStore Properties:

<hdiv:redisExternalStateStore host= "localhost" port= "6379" password= "redis"
    maxPool= "15" expireTime= "1200" />
  • host: REDIS server host
  • port: REDIS server port
  • password: key to login
  • maxPool: pool size for REDIS
  • expireTime: keys default expiry time

DatabaseExternalStateStore Properties:

<hdiv:databaseExternalStateStore dataSourceRef= "datasource"
    numberOfTables= "4" tablesSubjectName= "hdiv_pages_table_" />
  • dataSourceRef: Optional attribute. Datasource name as it is defined below
  • jndiDataSourceLookup: Optional attribute. Datasource JNDI name
  • numberOfTables: Number of sharding tables
  • tablesSubjectName: Table name in the database

If dataSourceRef property is present, then a datasource bean should be defined:

<bean id= "datasource" class= "org.apache.commons.dbcp.BasicDataSource" destroy-method= "close" >
    <property name= "driverClassName" value= "org.postgresql.Driver" />
    <property name= "url" value= "jdbc:postgresql://localhost/hdiv-ee-external" />
    <property name= "username" value= "hdiv" />
    <property name= "password" value= "hdiv-enterprise" />
</bean>

Master Configuration

Master configuration uses a properties file called hdiv-ee-config.properties for configuring the application.

The properties file is searched in two locations:

  • Hdiv general configuration folder, configured with system property hdiv.config.dir
  • In the root directory of the classpath

The following properties are related to the external storage part:

Key Value (Examples)
Key Value (Examples)
EXTERNAL_STORAGE.TYPE NONE/EMBEDDED_RDDBB/RDDBB/REDIS
EXTERNAL_STORAGE .DATABASE.URL jdbc:postgresql://localhost/hdiv-ee-external
EXTERNAL_STORAGE.DATABASE.DRIVER_CLASS org.postgresql.Driver
EXTERNAL_STORAGE.DATABASE.USERNAME hdiv
EXTERNAL_STORAGE.DATABASE.PASSWORD hdiv-enterprise
EXTERNAL_STORAGE.DATABASE.DATASOURCE_JNDI_NAME java:/comp/env/jdbc/SampleDS
EXTERNAL_STORAGE.DATABASE.SHARDING.NUMBER_OF_TABLES 4
EXTERNAL_STORAGE.DATABASE.SHARDING.TABLES_SUBJECT_NAME Hdiv_Pages_
EXTERNAL_STORAGE.REDIS.HOST localhost
EXTERNAL_STORAGE.REDIS.PORT 6379
EXTERNAL_STORAGE.REDIS.PASSWORD hdiv
EXTERNAL_STORAGE.REDIS.EXPIRE_TIME 12000 (seconds)
EXTERNAL_STORAGE.REDIS.MAXPOOL 15
EXTERNAL_STORAGE.CACHE_CONFIG See below
EXTERNAL_STORAGE.CACHE_FLUSH_TIMEOUT_SECONDS - 1
ENABLE_METRICS False by default (enable internal metrics)
EXTERNAL_LINKS_PROTECTION True by default
  • EXTERNAL_LINKS_PROTECTION: If external links protection is disabled, the absolute links and links to another domain or application are not protected by Hdiv, so the links will not contain an HDIV_STATE parameter.
  • EXTERNAL_STORAGE_CACHE_CONFIG: A special property that handles all the configuration of the caches by itself. The format of the value is:

CACHE_TYPE[,CACHE_SIZE][,BATCH_SIZE];CACHE_TYPE[,CACHE_SIZE][,BATCH_SIZE];...

  • CACHE_TYPE: Same values as the ones previously enumerated in Cache.type XML field
  • CACHE_SIZE: Optional value, size of the cache
  • BATCH_SIZE: Optional value, size of the batch persistence window

Java Configuration

Different java configurations are presented below

@Override
public void configureExternalStateStorage(final ExternalStateStorageConfigurer externalConfigurer) {
    externalConfigurer.databaseExternalStateStore(externalStorageDataSource()).numberOfTables(4)
        .tablesSubjectName("hdiv_pages_table");
}

@Bean(destroyMethod = "close")
public DataSource externalStorageDataSource() {
    final BasicDataSource dataSource = new BasicDataSource();
    dataSource.setDriverClassName("org.postgresql.Driver");
    dataSource.setUrl("jdbc:postgresql://localhost/postgres");
    dataSource.setUsername("postgres");
    dataSource.setPassword("postgres");
    return dataSource;
}

Sets a datasource connection using 4 sharding tables in a relational database and default cache configuration is used (SHARED IMC + RDC)

@Override
public void configureExternalStateStorage(ExternalStateStorageConfigurer externalConfigurer) {
    externalConfigurer.redisExternalStateStore().host("localhost").port(1231).password("pass")
        .maxpool(121).expireTime(123123);
}

REDIS configuration with default caches

@Override
public void configureExternalStateStorage(ExternalStateStorageConfigurer externalConfigurer) {
    externalConfigurer.redisExternalStateStore().host("localhost").port(1231).password("fernando");

    List<SingleCacheConfig> config = new ArrayList<>();
    config.add(new SingleCacheConfig(CacheType.SHARED));
    SingleCacheConfig sconfig = new SingleCacheConfig(CacheType.COMPRESSED);
    sconfig.setProperty(SingleCacheConfig.CACHE_SIZE, Integer.toString(10000));
    config.add(sconfig);
    config.add(new SingleCacheConfig(CacheType.EXT_MEMORY));

    externalConfigurer.cacheConfig(config);
}

REDIS configuration, with three cache levels: one SHARED IMC with default properties, one COMPRESSED IMC with 10000 size, and finally, REDIS IMDC.