See Resource Management in the System Administration Guide: Solaris Containers-Resource Management and Solaris Zones
Below is an extract, mostly word by word from the above document.
A workload is an aggregation of all processes of an application or group of applications.
Without resource management all activities on a system are given equal access to resources. Solaris resource management features enable you to treat workloads individually. It is possible to:
The ability to minimize cross-workload performance compromises, along with the facilities that monitor resource usage and utilization, is referred to as resource management. Resource management is implemented through a collection of algorithms. The algorithms handle the series of capability requests that an application presents in the course of its execution.
Resource management facilities permit you to modify the default behavior of the operating system with respect to different workloads. Behavior primarily refers to the set of decisions that are made by operating system algorithms when an application presents one or more resource requests to the system.
By implementing resource management, you may achieve the following goals:
When planning resource management, it is important to identify cooperating and conflicting workloads and create a configuration that presents the least compromise to the service goals of the business, within the limitations of the system’s capabilities.
Use resource management to ensure that your applications have the required response time.
Applications can be written to be aware of their resource constraints, but not all application writers will choose to do this. Therefore, contraints may prevent an application from functioning.
Scheduling refers to making a sequence of allocation decisions at specific intervals. The decision that is made is based on a predictable algorithm. An application that does not need its current allocation leaves the resource available for another application's use. Scheduling-based resource management enables full utilization of an undercommitted configuration, while providing controlled allocations in a critically committed or overcommitted scenario. The underlying algorithm defines how the term “controlled” is interpreted. In some instances, the scheduling algorithm might guarantee that all applications have some access to the resource. The fair share scheduler (FSS), for example, manages application access to CPU resources in a controlled way.
Partitioning is used to bind a workload to a subset of the system's available resources. This binding guarantees that a known amount of resources is always available to the workload. The resource pools functionality enables you to limit workloads to specific subsets of the machine.
Configurations that use partitioning can avoid system-wide overcommitment. However, in avoiding this overcommitment, the ability to achieve high utilizations can be reduced. A reserved group of resources, such as processors, is not available for use by another workload when the workload bound to them is idle.
Projects and tasks are used to label workloads and separate them from one another.
To optimize workload response, you must first be able to identify the workloads that are running on the system. This information can be difficult to obtain by using either a purely process-oriented or a user-oriented method alone. In the Solaris system, you have two additional facilities that can be used to separate and identify workloads: the project and the task.
The project provides a network-wide administrative identifier for related work.
The task collects a group of processes into a manageable entity that represents a workload component.
Projects are defined in a project database, which could be a local file /etc/projects, NIS or LDAP. The project records consists of the following fields:
projname:projid:comment:user-list:group-list:attributes
The project identifier, projid, can be thought of as a workload tag equivalent to the user and group identifiers. A user or group can belong to one or more projects. Although a user must be assigned to a default project, the processes that the user launches can be associated with any of the projects of which that user is a member. The system follows ordered steps to determine the default project. If no default project is found, the user’s login, or request to start a process, is denied.
The system sequentially follows these steps to determine a user’s default project:
The attributes are specified with name = value pair, where name is a resource control described on Available Resource Controls
Each successful login into a project creates a new task that contains the login process. The task is a collection of processes that represents a set of work over time. A task can also be viewed as a workload component. Each task is automatically assigned a task ID.
Each process is a member of one task, and each task is associated with one project.
A task is created whenever a project is joined. One can also create a task by using newtask command. You can also move a running process into a new task.
All operations on process groups, such as signal delivery, are also supported on tasks. You can also bind a task to a processor set and set a scheduling priority and class for a task, which modifies all current and subsequent processes in the task.
Most process commands such as id, ps, prstat, pgrep, pkill support options to display taskID ot projectID. See man pages for the details.
You can edit /etc/project file manually or with the following commands projects, projadd, projdel, projmod.
Threshold value on a resource control constitutes an enforcement point where local actions can be triggered or global actions, such as logging, can occur.
Each threshold value on a resource control must be associated with a privilege level. The privilege level must be one of the following three types:
A resource control is guaranteed to have one system value.
Any number of privileged values can be defined, and only one basic value is allowed. Operations that are performed without specifying a privilege value are assigned a basic privilege by default.
You can use the prctl command to modify values that are associated with basic and privileged levels.
To display default resource control values for the current shell
prctl $$
You can use the rctladm command to perform the following actions:
For example,
rctladm -e syslog project.max-locked-memory project.max-shm-memory
Local actions are taken on a process that attempts to exceed the control value. For each threshold value that is placed on a resource control, you can associate one or more actions. There are three types of local actions: none, deny, and signal=. For example,
projadd -K 'project.max-locked-memory=(priv,6442450944,none)' -K 'project.max-locked-memory=(priv,7516192768,deny)' user.oracle
Each resource control on the system has a certain set of associated properties. This set of properties is defined as a set of flags, which are associated with all controlled instances of that resource. Global flags cannot be modified, but the flags can be retrieved by using rctladm command.
rctladm process.max-port-events syslog=off [ deny count ] process.max-msg-messages syslog=off [ deny count ] process.max-msg-qbytes syslog=off [ deny bytes ] process.max-sem-ops syslog=off [ deny count ] process.max-sem-nsems syslog=off [ deny count ] process.max-address-space syslog=off [ lowerable deny no-signal bytes ] process.max-file-descriptor syslog=off [ lowerable deny count ] process.max-core-size syslog=off [ lowerable deny no-signal bytes ] process.max-stack-size syslog=off [ lowerable deny no-signal bytes ] process.max-data-size syslog=off [ lowerable deny no-signal bytes ] process.max-file-size syslog=off [ lowerable deny file-size bytes ] process.max-cpu-time syslog=off [ lowerable no-deny cpu-time inf seconds ] task.max-cpu-time syslog=off [ no-deny cpu-time no-obs inf seconds ] task.max-lwps syslog=off [ count ] project.max-contracts syslog=off [ no-basic deny count ] project.max-device-locked-memory syslog=off [ no-basic deny bytes ] project.max-locked-memory syslog=notice [ no-basic deny bytes ] project.max-port-ids syslog=off [ no-basic deny count ] project.max-shm-memory syslog=notice [ no-basic deny bytes ] project.max-shm-ids syslog=notice [ no-basic deny count ] project.max-msg-ids syslog=notice [ no-basic deny count ] project.max-sem-ids syslog=notice [ no-basic deny count ] project.max-crypto-memory syslog=off [ no-basic deny bytes ] project.max-tasks syslog=off [ no-basic count ] project.max-lwps syslog=notice [ no-basic count ] project.cpu-cap syslog=off [ no-basic deny no-signal inf count ] project.cpu-shares syslog=n/a [ no-basic no-deny no-signal no-syslog count ] zone.max-swap syslog=off [ no-basic deny bytes ] zone.max-locked-memory syslog=off [ no-basic deny bytes ] zone.max-shm-memory syslog=off [ no-basic deny bytes ] zone.max-shm-ids syslog=off [ no-basic deny count ] zone.max-sem-ids syslog=off [ no-basic deny count ] zone.max-msg-ids syslog=off [ no-basic deny count ] zone.max-lwps syslog=off [ no-basic count ] zone.cpu-cap syslog=off [ no-basic deny no-signal inf count ] zone.cpu-shares syslog=n/a [ no-basic no-deny no-signal no-syslog count ]
Local flags define the default behavior and configuration for a specific threshold value of that resource control on a specific process or a task. The local flags for one threshold value do not affect the behavior of other defined threshold values for the same resource control. However, the global flags affect the behavior for every value associated with a particular control. Local flags can be modified, within the constraints supplied by their corresponding global flags, by the prctl command. For example,
prctl -n project.max-lwps -i project user.oracle prctl -s -t privileged -n project.max-lwps -v 1024 -e deny -i project user.oracle
More than one resource control can exist on a resource. A resource control can exist at each containment level in the process model. If resource controls are active on the same resource at different container levels, the smallest container’s control is enforced first. Thus, action is taken on process.max-cpu-time before task.max-cpu-time if both controls are encountered simultaneously.
The analysis of workload data can indicate that a particular workload or group of workloads is monopolizing CPU resources. If these workloads are not violating resource constraints on CPU usage, you can modify the allocation policy for CPU time on the system. The fair share scheduling class enables you to allocate CPU time based on shares instead of the priority scheme of the timesharing (TS) scheduling class.
The term “share” is used to define a portion of the system’s CPU resources that is allocated to a project. If you assign a greater number of CPU shares to a project, relative to other projects, the project receives more CPU resources from the fair share scheduler.
CPU shares are not equivalent to percentages of CPU resources. Shares are used to define the relative importance of workloads in relation to other workloads. When you assign CPU shares to a project, your primary concern is not the number of shares the project has. Knowing how many shares the project has in comparison with other projects is more important. You must also take into account how many of those other projects will be competing with it for CPU resources.
Processes in projects with zero shares always run at the lowest system priority (0). These processes only run when projects with nonzero shares are not using CPU resources.
Shares serve to limit CPU usage only when there is competition from other projects. Regardless of how low a project’s allocation is, it always receives 100 percent of the processing power if it is running alone on the system. Available CPU cycles are never wasted. They are distributed between projects.
Users can be members of multiple projects that have different numbers of shares assigned. By moving processes from one project to another project, processes can be assigned CPU resources in varying amounts. You can also move processes from one scheduling class into another:
priocntl -s -c FSS -i class TS
The CPU allocations of projects running in one processor set are not affected by the CPU shares or activity of projects running in another processor set because the projects are not competing for the same resources. Projects only compete with each other if they are running within the same processor set.
By default, the FSS scheduling class uses the same range of priorities (0 to 59) as the timesharing (TS), interactive (IA), and fixed priority (FX) scheduling classes. Therefore, you should avoid having processes from these scheduling classes share the same processor set. A mix of processes in the FSS, TS, IA, and FX classes could result in unexpected scheduling behavior.
You can mix processes in the TS and IA classes in the same processor set, or on the same system without processor sets.
The Solaris system also offers a real-time (RT) scheduler to users with superuser privileges. By default, the RT scheduling class uses system priorities in a different range (usually from 100 to 159) than FSS. Because RT and FSS are using disjoint, or non-overlapping, ranges of priorities, FSS can coexist with the RT scheduling class within the same processor set. However, the FSS scheduling class does not have any control over processes that run in the RT class
You can use dispadmin command to set the default Scheduling Class for the system:
dispadmin -d FSS priocntl -s -c FSS
The resource capping daemon rcapd enables you to regulate physical memory consumption by processes running in projects that have resource caps rcap.max-rss defined.
Like the resource control, the resource cap can be defined by using attributes of project entries in the project database. However, while resource controls are synchronously enforced by the kernel, resource caps are asynchronously enforced at the user level by the resource capping daemon. With asynchronous enforcement, a small delay occurs as a result of the sampling interval used by the daemon. The sampling interval is specified by the administrator.
The daemon manages physical memory by regulating the size of a project workload’s resident set relative to the size of its working set. The resident set is the set of pages that are resident in physical memory. The working set is the set of pages that the workload actively uses during its processing cycle. The working set changes over time, depending on the process’s mode of operation and the type of data being processed. Ideally, every workload has access to enough physical memory to enable its working set to remain resident. However, the working set can also include the use of secondary disk storage to hold the memory that does not fit in physical memory.
To configure a system to use rcap daemon, use rcapadm -E command or scvadm enable rcap.
To monitor resource utilization of capped projects, use rcapstat command.
Resource pools enable you to separate workloads so that workload consumption of certain resources does not overlap. Resource pools provide a persistent configuration mechanism for processor set (pset) configuration and, optionally, scheduling class assignment.
A pool can be thought of as a specific binding of the various resource sets that are available on your system. You can create pools that represent different kinds of possible resource combinations. By grouping multiple partitions, pools provide a handle to associate with labeled workloads. Each project entry in the /etc/project file can have a single pool associated with that entry, which is specified using the project.pool attribute.
Dynamic resource pools provide a mechanism for dynamically adjusting each pool’s resource allocation in response to system events and application load changes. DRPs simplify and reduce the number of decisions required from an administrator. Adjustments are automatically made to preserve the system performance goals specified by an administrator. The changes made to the configuration are logged. These features are primarily enacted through the resource controller poold, a system daemon that should always be active when dynamic resource allocation is required. Periodically, poold examines the load on the system and determines whether intervention is required to enable the system to maintain optimal performance with respect to resource consumption.
Static and dynamic pools are represented by separate SMF services: pools and pools/dynamic (poold).
Static resource pools can be enabled/disabled with pooladm command as well as with svcadm. The configuration is stored in /etc/pooladm.conf, which can be created/updated with poolcfg command.
The /etc/pooladm.conf configuration file describes the static pools configuration. The kernel holds information about the disposition of resources within the resource pools framework. This is known as the dynamic configuration, and it represents the resource pools functionality for a particular system at a point in time. The dynamic configuration can be viewed by using the pooladm command. Modifications to the dynamic configuration are made in the following ways:
More than one static pools configuration file can exist, for activation at different times. You can alternate between multiple pools configurations by invoking pooladm from a cron job.
By default, the resource pools framework is not active. Resource pools must be enabled to create or modify the dynamic configuration. Static configuration files can be manipulated with the poolcfg command even if the resource pools framework is disabled. Static configuration files cannot be created if the pools facility is not active.
The project.pool attribute can be added to a project entry in the /etc/project file to associate a single pool with that entry. New work that is started on a project is bound to the appropriate pool. For example,
projmod -a -K project.pool=mypool sales
To enable poold daemon:
svcadm enable pools/dynamic svcs pools/dynamic
STATE STIME FMRI online 3:15:08 svc:/system/pools/dynamic:default
The following properties can be set:
Property Name | Type | Category | Description | ||||||
system.poold.log-level | string | Configuration | Logging level | ||||||
system.poold.log-location | string | Configuration | Logging location | ||||||
system.poold.monitor-interval | uint64 | Configuration | Monitoring sample interval | ||||||
system.poold.history-file | string | Configuration | Decision history location | ||||||
pset.max | uint64 | Constraint | Maximum number of CPUs for this processor set | ||||||
pset.min | uint64 | Constraint | Minimum number of CPUs for this processor set | ||||||
cpu.pinned | bool | Constraint | CPUs pinned to this processor set | ||||||
system.poold.objectives | string | Objective | Formatted string following poold’s objective expression syntax | ||||||
pset.poold.objectives | string | Objective | Formatted string following poold’s expression syntax | ||||||
pool.importance | int64 | Objective parameter | User-assigned importance |
The pset.min and pset.max constraints place limits on the number of processors that can be allocated to a processor set, both minimum and maximum.
The cpu-pinned property indicates that a particular CPU should not be moved by DRP (dynamic reconfiguration process) from the processor set in which it is located. You can set this property to maximize cache utilization for a particular application that is executing within a processor set.
The pool.importance property describes the relative importance of a pool as defined by the administrator.
A workload-dependent objective is an objective that will vary according to the nature of the workload running on the system. An example is the utilization objective. The utilization figure for a resource set will vary according to the nature of the workload that is active in the set.
A workload-independent objective is an objective that does not vary according to the nature of the workload running on the system. An example is the CPU locality objective. The evaluated measure of locality for a resource set does not vary with the nature of the workload that is active in the set.
Three types of objectives can be defined:
Name | Valid Elements | Operators | Values | ||||||||
wt-load | system | N/A | N/A | ||||||||
locality | pset | N/A | loose | tight | none | ||||||
utilization | pset | < > ~ | 0–100% |
All objectives take an optional importance prefix. The importance acts as a multiplier for the objective and thus increases the significance of its contribution to the objective function evaluation. The range is from 0 to INT64_MAX (9223372036854775807). If not specified, the default importance value is 1.
Some element types support more than one type of objective. An example is pset. You can specify multiple objective types for these elements. You can also specify multiple utilization objectives on a single pset element.
The wt-load objective favors configurations that match resource allocations to resource utilizations. A resource set that uses more resources will be given more resources when this objective is active. wt-load means weighted load. Use this objective when you are satisfied with the constraints you have established using the minimum and maximum properties, and you would like the daemon to manipulate resources freely within those constraints.
The locality objective influences the impact that locality, as measured by locality group (lgroup) data, has upon the selected configuration. An alternate definition for locality is latency. An lgroup describes CPU and memory resources. The lgroup is used by the Solaris system to determine the distance between resources, using time as the measurement. This objective can take one of the following three values:
In general, the locality objective should be set to tight. However, to maximize memory bandwidth or to minimize the impact of DR operations on a resource set, you could set this objective to loose or keep it at the default setting of none.
The utilization objective favors configurations that allocate resources to partitions that are not meeting the specified utilization objective. This objective is specified by using operators and values. The operators are as follows:
A pset can only have one utilization objective set for each type of operator. If the ~ operator is set, then the < and > operators cannot be set. If the < and > operators are set, then the ~ operator cannot be set. You can set both a < and a > operator together to create a range. The values will be validated to make sure that they do not overlap. For example,
pset.poold.objectives "utilization > 30; utilization < 80; locality tight"
system.poold.monitor-interval specifies a value in milliseconds. system.poold.log-level specifies the logging parameter. If this property is not specified, the default logging level is NOTICE. The parameter levels are hierarchical. Setting a log level of DEBUG will cause poold to log all defined messages. The INFO level provides a useful balance of information for most administrators.
The system.poold.log-location property is used to specify the location for poold logged output. You can specify a location of SYSLOG for poold output. If this property is not specified, the default location for poold logged output is /var/log/pool/poold. When poold is invoked from the command line, this property is not used. Log entries are written to stderr on the invoking terminal. If poold is active, the logadm.conf file includes an entry to manage the default file /var/log/pool/poold:
/var/log/pool/poold -N -s 512k
poolbind can be used for the manual binding of projects, tasks, and processes to a resource pool.
psrset can no longer be used for managing processor sets. Use pooladm or poolcfg instead.
poolstat can be used to monitor pools statisctics and resource utilization
To enable resource pools daemon
pooladm -e svcs pools STATE STIME FMRI online 2:35:01 svc:/system/pools:default
svcs pools/dynamic STATE STIME FMRI disabled 4:34:52 svc:/system/pools/dynamic:default
To build default resouce pools and psets using all available system resouces
pooladm -s poolcfg -c 'info' system default string system.comment int system.version 1 boolean system.bind-default true string system.poold.objectives wt-load
pool pool_default int pool.sys_id 0 boolean pool.active true boolean pool.default true int pool.importance 1 string pool.comment pset pset_default
pset pset_default int pset.sys_id -1 boolean pset.default true uint pset.min 1 uint pset.max 65536 string pset.units population uint pset.load 29 uint pset.size 96 string pset.comment
cpu int cpu.sys_id 117 string cpu.comment string cpu.status on-line
cpu int cpu.sys_id 116 string cpu.comment string cpu.status on-line ...
To create a pset
poolcfg -c 'create pset pset_oracle (uint pset.min = 2; uint pset.max = 10)'
To create a pool
poolcfg -c 'create pool pool_oracle'
To associate the pool with the pset above
poolcfg -c 'associate pool pool_oracle (pset pset_oracle)'
To change the scheduler for the pool
poolcfg -c 'modify pool pool_oracle (string pool.scheduler="FSS")'
To protect certain processors from DRP
poolcfg -c 'modify cpu <cpuid> (boolean cpu.pinned = true)'
To define pset objectives
poolcfg -c 'modify pset pset_oracle (string pset.poold.objectives="utilization > 20; utilization < 80; locality tight")'
To change the logging level
poolcfg -c 'modify system (string system.poold.log-level="INFO")'
To dynamically transfer two CPUs from one pset to another
poolcfg -dc 'transfer 2 from pset pset_default to pset_oracle'
or to transfer specific CPUs
poolcfg -dc 'transfer to pset pset_oracle (cpu 0; cpu 2)'
To review the current config
poolcfg -c 'info'
To validate the current config
pooladm -n -c
To apply the current config
pooladm -c
To bind a process to a pool
poolbind -p pool_oracle <oracle_pid>
To verify the binding
poolbind -q <oracle_pid>
To bind a project user.oracle to a pool pool_oracle dynamically
poolbind -i project -p pool_oracle user.oracle
or statically
projmod -a -K project.pool=pool_oracle user.oracle
To remove the current active configuration:
pooladm -x
To monitor pools utilization
poolstat -o id,pool,type,rid,rset,min,max,size,used,load -r all id pool type rid rset min max size used load 0 pool_default pset -1 pset_default 1 66K 96 0.00 0.04
For a more complex example, see Chapter 14 Resource Management Configuration Example in Solaris Resource Management