Solaris Resource Management

See Resource Management in the System Administration Guide: Solaris Containers-Resource Management and Solaris Zones

Below is an extract, mostly word by word from the above document.

Overview

Definitions

A workload is an aggregation of all processes of an application or group of applications.

Without resource management all activities on a system are given equal access to resources. Solaris resource management features enable you to treat workloads individually. It is possible to:

 The ability to minimize cross-workload performance compromises, along with the facilities that monitor resource usage and utilization, is referred to as resource management. 
 Resource management is implemented through a collection of algorithms. 
 The algorithms handle the series of capability requests that an application presents in the course of its execution.
 Resource management facilities permit you to modify the default behavior of the operating system with respect to different workloads.
 Behavior primarily refers to the set of decisions that are made by operating system algorithms when an application presents one or more resource requests to the system.

By implementing resource management, you may achieve the following goals:

When planning resource management, it is important to identify cooperating and conflicting workloads and create a configuration that presents the least compromise to the service goals of the business, within the limitations of the system’s capabilities.

Use resource management to ensure that your applications have the required response time.

Resource Management Control Mechanisms

Constraint Mechanisms

Applications can be written to be aware of their resource constraints, but not all application writers will choose to do this. Therefore, contraints may prevent an application from functioning.

Scheduling Mechanisms

 Scheduling refers to making a sequence of allocation decisions at specific intervals. 
 The decision that is made is based on a predictable algorithm. 
 An application that does not need its current allocation leaves the resource available for another application's use.
 Scheduling-based resource management enables full utilization of an undercommitted configuration,
 while providing controlled allocations in a critically committed or overcommitted scenario. 
 The underlying algorithm defines how the term “controlled” is interpreted. 
 In some instances, the scheduling algorithm might guarantee that all applications have some access to the resource.
 The fair share scheduler (FSS), for example, manages application access to CPU resources in a controlled way.

Partitioning Mechanisms

 Partitioning is used to bind a workload to a subset of the system's available resources.
 This binding guarantees that a known amount of resources is always available to the workload.
 The resource pools functionality enables you to limit workloads to specific subsets of the machine.
 Configurations that use partitioning can avoid system-wide overcommitment.
 However, in avoiding this overcommitment, the ability to achieve high utilizations can be reduced.
 A reserved group of resources, such as processors, is not available for use by another workload when the workload bound to them is idle.

Projects and Tasks

Projects and tasks are used to label workloads and separate them from one another.

To optimize workload response, you must first be able to identify the workloads that are running on the system. This information can be difficult to obtain by using either a purely process-oriented or a user-oriented method alone. In the Solaris system, you have two additional facilities that can be used to separate and identify workloads: the project and the task.

The project provides a network-wide administrative identifier for related work.

The task collects a group of processes into a manageable entity that represents a workload component.

Projects

Projects are defined in a project database, which could be a local file /etc/projects, NIS or LDAP. The project records consists of the following fields:

 projname:projid:comment:user-list:group-list:attributes

The project identifier, projid, can be thought of as a workload tag equivalent to the user and group identifiers. A user or group can belong to one or more projects. Although a user must be assigned to a default project, the processes that the user launches can be associated with any of the projects of which that user is a member. The system follows ordered steps to determine the default project. If no default project is found, the user’s login, or request to start a process, is denied.

The system sequentially follows these steps to determine a user’s default project:

The attributes are specified with name = value pair, where name is a resource control described on Available Resource Controls

Tasks

Each successful login into a project creates a new task that contains the login process. The task is a collection of processes that represents a set of work over time. A task can also be viewed as a workload component. Each task is automatically assigned a task ID.

Each process is a member of one task, and each task is associated with one project.

A task is created whenever a project is joined. One can also create a task by using newtask command. You can also move a running process into a new task.

All operations on process groups, such as signal delivery, are also supported on tasks. You can also bind a task to a processor set and set a scheduling priority and class for a task, which modifies all current and subsequent processes in the task.

Process commands with options

Most process commands such as id, ps, prstat, pgrep, pkill support options to display taskID ot projectID. See man pages for the details.

Administering Projects

You can edit /etc/project file manually or with the following commands projects, projadd, projdel, projmod.

Resource Controls

Resource Control Values and Privilege Levels

Threshold value on a resource control constitutes an enforcement point where local actions can be triggered or global actions, such as logging, can occur.

Each threshold value on a resource control must be associated with a privilege level. The privilege level must be one of the following three types:

A resource control is guaranteed to have one system value.

Any number of privileged values can be defined, and only one basic value is allowed. Operations that are performed without specifying a privilege value are assigned a basic privilege by default.

You can use the prctl command to modify values that are associated with basic and privileged levels.

To display default resource control values for the current shell

 prctl $$

Global Actions on Resource Control Values

You can use the rctladm command to perform the following actions:

For example,

 rctladm -e syslog project.max-locked-memory project.max-shm-memory 

Local Actions on Resource Control Values

Local actions are taken on a process that attempts to exceed the control value. For each threshold value that is placed on a resource control, you can associate one or more actions. There are three types of local actions: none, deny, and signal=. For example,

 projadd -K 'project.max-locked-memory=(priv,6442450944,none)' -K 'project.max-locked-memory=(priv,7516192768,deny)' user.oracle

Resource Control Flags and Properties

Each resource control on the system has a certain set of associated properties. This set of properties is defined as a set of flags, which are associated with all controlled instances of that resource. Global flags cannot be modified, but the flags can be retrieved by using rctladm command.

 rctladm
 
 process.max-port-events     syslog=off     [ deny count ]
 process.max-msg-messages    syslog=off     [ deny count ]
 process.max-msg-qbytes      syslog=off     [ deny bytes ]
 process.max-sem-ops         syslog=off     [ deny count ]
 process.max-sem-nsems       syslog=off     [ deny count ]
 process.max-address-space   syslog=off     [ lowerable deny no-signal bytes ]
 process.max-file-descriptor syslog=off     [ lowerable deny count ]
 process.max-core-size       syslog=off     [ lowerable deny no-signal bytes ]
 process.max-stack-size      syslog=off     [ lowerable deny no-signal bytes ]
 process.max-data-size       syslog=off     [ lowerable deny no-signal bytes ]
 process.max-file-size       syslog=off     [ lowerable deny file-size bytes ]
 process.max-cpu-time        syslog=off     [ lowerable no-deny cpu-time inf seconds ]
 task.max-cpu-time           syslog=off     [ no-deny cpu-time no-obs inf seconds ]
 task.max-lwps               syslog=off     [ count ]
 project.max-contracts       syslog=off     [ no-basic deny count ]
 project.max-device-locked-memory syslog=off     [ no-basic deny bytes ]
 project.max-locked-memory   syslog=notice  [ no-basic deny bytes ]
 project.max-port-ids        syslog=off     [ no-basic deny count ]
 project.max-shm-memory      syslog=notice  [ no-basic deny bytes ]
 project.max-shm-ids         syslog=notice  [ no-basic deny count ]
 project.max-msg-ids         syslog=notice  [ no-basic deny count ]
 project.max-sem-ids         syslog=notice  [ no-basic deny count ]
 project.max-crypto-memory   syslog=off     [ no-basic deny bytes ]
 project.max-tasks           syslog=off     [ no-basic count ]
 project.max-lwps            syslog=notice  [ no-basic count ]
 project.cpu-cap             syslog=off     [ no-basic deny no-signal inf count ]
 project.cpu-shares          syslog=n/a     [ no-basic no-deny no-signal no-syslog count ]
 zone.max-swap               syslog=off     [ no-basic deny bytes ]
 zone.max-locked-memory      syslog=off     [ no-basic deny bytes ]
 zone.max-shm-memory         syslog=off     [ no-basic deny bytes ]
 zone.max-shm-ids            syslog=off     [ no-basic deny count ]
 zone.max-sem-ids            syslog=off     [ no-basic deny count ]
 zone.max-msg-ids            syslog=off     [ no-basic deny count ]
 zone.max-lwps               syslog=off     [ no-basic count ]
 zone.cpu-cap                syslog=off     [ no-basic deny no-signal inf count ]
 zone.cpu-shares             syslog=n/a     [ no-basic no-deny no-signal no-syslog count ]

Local flags define the default behavior and configuration for a specific threshold value of that resource control on a specific process or a task. The local flags for one threshold value do not affect the behavior of other defined threshold values for the same resource control. However, the global flags affect the behavior for every value associated with a particular control. Local flags can be modified, within the constraints supplied by their corresponding global flags, by the prctl command. For example,

 prctl -n project.max-lwps -i project user.oracle
 prctl -s -t privileged -n project.max-lwps -v 1024 -e deny -i project user.oracle

Resource Control Enforcement

More than one resource control can exist on a resource. A resource control can exist at each containment level in the process model. If resource controls are active on the same resource at different container levels, the smallest container’s control is enforced first. Thus, action is taken on process.max-cpu-time before task.max-cpu-time if both controls are encountered simultaneously.

Fair Share Scheduler

The analysis of workload data can indicate that a particular workload or group of workloads is monopolizing CPU resources. If these workloads are not violating resource constraints on CPU usage, you can modify the allocation policy for CPU time on the system. The fair share scheduling class enables you to allocate CPU time based on shares instead of the priority scheme of the timesharing (TS) scheduling class.

The term “share” is used to define a portion of the system’s CPU resources that is allocated to a project. If you assign a greater number of CPU shares to a project, relative to other projects, the project receives more CPU resources from the fair share scheduler.

CPU shares are not equivalent to percentages of CPU resources. Shares are used to define the relative importance of workloads in relation to other workloads. When you assign CPU shares to a project, your primary concern is not the number of shares the project has. Knowing how many shares the project has in comparison with other projects is more important. You must also take into account how many of those other projects will be competing with it for CPU resources.

Processes in projects with zero shares always run at the lowest system priority (0). These processes only run when projects with nonzero shares are not using CPU resources.

Shares serve to limit CPU usage only when there is competition from other projects. Regardless of how low a project’s allocation is, it always receives 100 percent of the processing power if it is running alone on the system. Available CPU cycles are never wasted. They are distributed between projects.

Users can be members of multiple projects that have different numbers of shares assigned. By moving processes from one project to another project, processes can be assigned CPU resources in varying amounts. You can also move processes from one scheduling class into another:

 priocntl -s -c FSS -i class TS

FSS and Processor Sets

The CPU allocations of projects running in one processor set are not affected by the CPU shares or activity of projects running in another processor set because the projects are not competing for the same resources. Projects only compete with each other if they are running within the same processor set.

Combining FSS With Other Scheduling Classes

By default, the FSS scheduling class uses the same range of priorities (0 to 59) as the timesharing (TS), interactive (IA), and fixed priority (FX) scheduling classes. Therefore, you should avoid having processes from these scheduling classes share the same processor set. A mix of processes in the FSS, TS, IA, and FX classes could result in unexpected scheduling behavior.

You can mix processes in the TS and IA classes in the same processor set, or on the same system without processor sets.

The Solaris system also offers a real-time (RT) scheduler to users with superuser privileges. By default, the RT scheduling class uses system priorities in a different range (usually from 100 to 159) than FSS. Because RT and FSS are using disjoint, or non-overlapping, ranges of priorities, FSS can coexist with the RT scheduling class within the same processor set. However, the FSS scheduling class does not have any control over processes that run in the RT class

Setting the Scheduling Class for the System

You can use dispadmin command to set the default Scheduling Class for the system:

 dispadmin -d FSS
 priocntl -s -c FSS

Physical Memory Control Using the Resource Capping Daemon

The resource capping daemon rcapd enables you to regulate physical memory consumption by processes running in projects that have resource caps rcap.max-rss defined.

Like the resource control, the resource cap can be defined by using attributes of project entries in the project database. However, while resource controls are synchronously enforced by the kernel, resource caps are asynchronously enforced at the user level by the resource capping daemon. With asynchronous enforcement, a small delay occurs as a result of the sampling interval used by the daemon. The sampling interval is specified by the administrator.

The daemon manages physical memory by regulating the size of a project workload’s resident set relative to the size of its working set. The resident set is the set of pages that are resident in physical memory. The working set is the set of pages that the workload actively uses during its processing cycle. The working set changes over time, depending on the process’s mode of operation and the type of data being processed. Ideally, every workload has access to enough physical memory to enable its working set to remain resident. However, the working set can also include the use of secondary disk storage to hold the memory that does not fit in physical memory.

To configure a system to use rcap daemon, use rcapadm -E command or scvadm enable rcap.

To monitor resource utilization of capped projects, use rcapstat command.

Static and Dynamic Resource Pools

Resource pools enable you to separate workloads so that workload consumption of certain resources does not overlap. Resource pools provide a persistent configuration mechanism for processor set (pset) configuration and, optionally, scheduling class assignment.

A pool can be thought of as a specific binding of the various resource sets that are available on your system. You can create pools that represent different kinds of possible resource combinations. By grouping multiple partitions, pools provide a handle to associate with labeled workloads. Each project entry in the /etc/project file can have a single pool associated with that entry, which is specified using the project.pool attribute.

Dynamic resource pools provide a mechanism for dynamically adjusting each pool’s resource allocation in response to system events and application load changes. DRPs simplify and reduce the number of decisions required from an administrator. Adjustments are automatically made to preserve the system performance goals specified by an administrator. The changes made to the configuration are logged. These features are primarily enacted through the resource controller poold, a system daemon that should always be active when dynamic resource allocation is required. Periodically, poold examines the load on the system and determines whether intervention is required to enable the system to maintain optimal performance with respect to resource consumption.

Static and dynamic pools are represented by separate SMF services: pools and pools/dynamic (poold).

Static resource pools can be enabled/disabled with pooladm command as well as with svcadm. The configuration is stored in /etc/pooladm.conf, which can be created/updated with poolcfg command.

The /etc/pooladm.conf configuration file describes the static pools configuration. The kernel holds information about the disposition of resources within the resource pools framework. This is known as the dynamic configuration, and it represents the resource pools functionality for a particular system at a point in time. The dynamic configuration can be viewed by using the pooladm command. Modifications to the dynamic configuration are made in the following ways:

More than one static pools configuration file can exist, for activation at different times. You can alternate between multiple pools configurations by invoking pooladm from a cron job.

By default, the resource pools framework is not active. Resource pools must be enabled to create or modify the dynamic configuration. Static configuration files can be manipulated with the poolcfg command even if the resource pools framework is disabled. Static configuration files cannot be created if the pools facility is not active.

The project.pool attribute can be added to a project entry in the /etc/project file to associate a single pool with that entry. New work that is started on a project is bound to the appropriate pool. For example,

 projmod -a -K project.pool=mypool sales

Managing Dynamic Resource Pools

To enable poold daemon:

 svcadm enable pools/dynamic
 svcs pools/dynamic
 STATE          STIME    FMRI
 online          3:15:08 svc:/system/pools/dynamic:default

The following properties can be set:

Property NameTypeCategoryDescription
system.poold.log-level string Configuration Logging level
system.poold.log-location string Configuration Logging location
system.poold.monitor-interval uint64Configuration Monitoring sample interval
system.poold.history-file string Configuration Decision history location
pset.max uint64Constraint Maximum number of CPUs for this processor set
pset.min uint64Constraint Minimum number of CPUs for this processor set
cpu.pinned bool Constraint CPUs pinned to this processor set
system.poold.objectives string Objective Formatted string following poold’s objective expression syntax
pset.poold.objectives string Objective Formatted string following poold’s expression syntax
pool.importance int64Objective parameter User-assigned importance

The pset.min and pset.max constraints place limits on the number of processors that can be allocated to a processor set, both minimum and maximum.

The cpu-pinned property indicates that a particular CPU should not be moved by DRP (dynamic reconfiguration process) from the processor set in which it is located. You can set this property to maximize cache utilization for a particular application that is executing within a processor set.

The pool.importance property describes the relative importance of a pool as defined by the administrator.

A workload-dependent objective is an objective that will vary according to the nature of the workload running on the system. An example is the utilization objective. The utilization figure for a resource set will vary according to the nature of the workload that is active in the set.

A workload-independent objective is an objective that does not vary according to the nature of the workload running on the system. An example is the CPU locality objective. The evaluated measure of locality for a resource set does not vary with the nature of the workload that is active in the set.

Three types of objectives can be defined:

NameValid ElementsOperatorsValues
wt-load system N/A N/A
locality pset N/A loose tight none
utilization pset < > ~0–100%

All objectives take an optional importance prefix. The importance acts as a multiplier for the objective and thus increases the significance of its contribution to the objective function evaluation. The range is from 0 to INT64_MAX (9223372036854775807). If not specified, the default importance value is 1.

Some element types support more than one type of objective. An example is pset. You can specify multiple objective types for these elements. You can also specify multiple utilization objectives on a single pset element.

The wt-load objective favors configurations that match resource allocations to resource utilizations. A resource set that uses more resources will be given more resources when this objective is active. wt-load means weighted load. Use this objective when you are satisfied with the constraints you have established using the minimum and maximum properties, and you would like the daemon to manipulate resources freely within those constraints.

The locality objective influences the impact that locality, as measured by locality group (lgroup) data, has upon the selected configuration. An alternate definition for locality is latency. An lgroup describes CPU and memory resources. The lgroup is used by the Solaris system to determine the distance between resources, using time as the measurement. This objective can take one of the following three values:

In general, the locality objective should be set to tight. However, to maximize memory bandwidth or to minimize the impact of DR operations on a resource set, you could set this objective to loose or keep it at the default setting of none.

The utilization objective favors configurations that allocate resources to partitions that are not meeting the specified utilization objective. This objective is specified by using operators and values. The operators are as follows:

A pset can only have one utilization objective set for each type of operator. If the ~ operator is set, then the < and > operators cannot be set. If the < and > operators are set, then the ~ operator cannot be set. You can set both a < and a > operator together to create a range. The values will be validated to make sure that they do not overlap. For example,

 pset.poold.objectives "utilization > 30; utilization < 80; locality tight"

system.poold.monitor-interval specifies a value in milliseconds. system.poold.log-level specifies the logging parameter. If this property is not specified, the default logging level is NOTICE. The parameter levels are hierarchical. Setting a log level of DEBUG will cause poold to log all defined messages. The INFO level provides a useful balance of information for most administrators.

The system.poold.log-location property is used to specify the location for poold logged output. You can specify a location of SYSLOG for poold output. If this property is not specified, the default location for poold logged output is /var/log/pool/poold. When poold is invoked from the command line, this property is not used. Log entries are written to stderr on the invoking terminal. If poold is active, the logadm.conf file includes an entry to manage the default file /var/log/pool/poold:

 /var/log/pool/poold -N -s 512k 

poolbind can be used for the manual binding of projects, tasks, and processes to a resource pool.

psrset can no longer be used for managing processor sets. Use pooladm or poolcfg instead.

poolstat can be used to monitor pools statisctics and resource utilization

Pools Examples

To enable resource pools daemon

 pooladm -e
 svcs pools
 STATE          STIME    FMRI
 online          2:35:01 svc:/system/pools:default
 svcs pools/dynamic 
 STATE          STIME    FMRI
 disabled        4:34:52 svc:/system/pools/dynamic:default

To build default resouce pools and psets using all available system resouces

 pooladm -s
 poolcfg -c 'info'
 system default
        string  system.comment
        int     system.version 1
        boolean system.bind-default true
        string  system.poold.objectives wt-load
        pool pool_default
                int     pool.sys_id 0
                boolean pool.active true
                boolean pool.default true
                int     pool.importance 1
                string  pool.comment
                pset    pset_default
        pset pset_default
                int     pset.sys_id -1
                boolean pset.default true
                uint    pset.min 1
                uint    pset.max 65536
                string  pset.units population
                uint    pset.load 29
                uint    pset.size 96
                string  pset.comment
                cpu
                        int     cpu.sys_id 117
                        string  cpu.comment
                        string  cpu.status on-line
                cpu
                        int     cpu.sys_id 116
                        string  cpu.comment
                        string  cpu.status on-line
 ...

To create a pset

 poolcfg -c 'create pset pset_oracle (uint pset.min = 2; uint pset.max = 10)'

To create a pool

 poolcfg -c 'create pool pool_oracle'

To associate the pool with the pset above

 poolcfg -c 'associate pool pool_oracle (pset pset_oracle)'

To change the scheduler for the pool

 poolcfg -c 'modify pool pool_oracle (string pool.scheduler="FSS")'

To protect certain processors from DRP

 poolcfg -c 'modify cpu <cpuid> (boolean cpu.pinned = true)'

To define pset objectives

 poolcfg -c 'modify pset pset_oracle (string pset.poold.objectives="utilization > 20; utilization < 80; locality tight")'

To change the logging level

 poolcfg -c 'modify system (string system.poold.log-level="INFO")'

To dynamically transfer two CPUs from one pset to another

 poolcfg -dc 'transfer 2 from pset pset_default to pset_oracle'

or to transfer specific CPUs

 poolcfg -dc 'transfer to pset pset_oracle (cpu 0; cpu 2)'

To review the current config

 poolcfg -c 'info'

To validate the current config

 pooladm -n -c

To apply the current config

 pooladm -c

To bind a process to a pool

 poolbind -p pool_oracle <oracle_pid>

To verify the binding

 poolbind -q <oracle_pid>

To bind a project user.oracle to a pool pool_oracle dynamically

 poolbind -i project -p pool_oracle user.oracle

or statically

 projmod -a -K project.pool=pool_oracle user.oracle

To remove the current active configuration:

 pooladm -x

To monitor pools utilization

 poolstat -o id,pool,type,rid,rset,min,max,size,used,load -r all
 id pool                 type rid rset                  min  max size used load
  0 pool_default         pset  -1 pset_default            1  66K   96 0.00 0.04

For a more complex example, see Chapter 14 Resource Management Configuration Example in Solaris Resource Management