Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revision | |||
en:dpi:dpi_components:mediaserver:ccu:start [2023/08/29 07:49] – edrudichgmailcom | en:dpi:dpi_components:mediaserver:ccu:start [Unknown date] (current) – removed - external edit (Unknown date) 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Cache management utility ====== | ||
- | {{indexmenu_n> | ||
- | CCU - Cache Control Utility | ||
- | CCU is designed to load data into the cache, manage its content and collect statistics. The utility supports a flexible configuration which includes all the specified parameters. | ||
- | |||
- | ======= CCU configuration ======= | ||
- | |||
- | The main configuration file is the // | ||
- | |||
- | ===== YAML format ===== | ||
- | The basic elements of the [[https:// | ||
- | * " | ||
- | * value sequence, for example:\\ '' | ||
- | |||
- | It is possible to create complex data structures based on the basic elements. | ||
- | |||
- | One of the principles describing complex structures laid down in the [[https:// | ||
- | <code bash> | ||
- | # | ||
- | # Time classes | ||
- | time_classes: | ||
- | default: | ||
- | # All others items define the names of time classes and rules for them definition | ||
- | # Each rule is started from the name of day category and contains | ||
- | # the list of ranges of times in form " | ||
- | |||
- | peak: | ||
- | workdays: | ||
- | - " | ||
- | - " | ||
- | weekend_eve: | ||
- | - " | ||
- | - " | ||
- | weekend: | ||
- | - " | ||
- | - " | ||
- | holidays: | ||
- | - " | ||
- | - " | ||
- | </ | ||
- | |||
- | can be treated as the '' | ||
- | |||
- | Given the above, the main rule when editing [[https:// | ||
- | |||
- | ===== Conventions to specify the parameters ===== | ||
- | * To specify the name of a particular field in a complex nested structure, you should use the full field name, which is an enumeration of all field names separated by a slash, for example: '' | ||
- | * To specify the name of user-defined element or the name of element that refers to another element, you should use the symbolic element definition specified between the signs less and more, for example: '' | ||
- | * Unless otherwise specified, the full name is specified to be relative to the entire configuration file, for example: '' | ||
- | |||
- | ===== Parameters ===== | ||
- | ==== Working directories ==== | ||
- | The following two parameters specify the directories where files with running processes identifiers will be stored -'' | ||
- | <code bash> | ||
- | # | ||
- | # PID files path, default: / | ||
- | pid_files_path: | ||
- | |||
- | # | ||
- | # Work files directory, default: / | ||
- | work_files_path: | ||
- | </ | ||
- | |||
- | ==== Events ==== | ||
- | When the certain events occur the ** ccu ** can perform actions specified by the user. Actions to be executed should be specified to meet the rules of the command line interpreter.\\ | ||
- | Only one event currently is supported: creation of a text file containing information about objects stored in caches. To set the response to the event, you should define the '' | ||
- | |||
- | <code bash> | ||
- | # | ||
- | # Events | ||
- | events: | ||
- | # Command which should be executed after caches' | ||
- | on_after_enumeration_creation: | ||
- | </ | ||
- | |||
- | ==== Logging ==== | ||
- | Logging options include the path to the log files, as well as the message levels to be written to the file. The most detailed journaling level is the //'' | ||
- | The logging levels are specified separately for each **ccu** command. | ||
- | |||
- | <code bash> | ||
- | # | ||
- | # Logging parameters | ||
- | logging: | ||
- | path: / | ||
- | levels: | ||
- | # Possible levels are: | ||
- | # | ||
- | # default logging level is " | ||
- | load: " | ||
- | purge: | ||
- | remove: | ||
- | online: | ||
- | monitor: | ||
- | </ | ||
- | |||
- | ==== Statistics collection ==== | ||
- | The fields of the '' | ||
- | |||
- | <code bash> | ||
- | # | ||
- | # Statistics parameters | ||
- | statistics: | ||
- | collectors: | ||
- | local: | ||
- | host: | ||
- | port: | ||
- | </ | ||
- | |||
- | ==== Day categories and time classes ==== | ||
- | Day classes and time classes are used to specify the restrictions for data loading to the cache. | ||
- | In order to specify the category of the day, you should create the '' | ||
- | The day list element can be both the name of the day of the week, as well as a partially or fully defined date. | ||
- | When determining the category of a day for a specific date, the more particular definitions are examined in the first place, after that the more general ones are examined. | ||
- | |||
- | <code bash> | ||
- | # | ||
- | # Day categories | ||
- | day_categories: | ||
- | # Items under " | ||
- | # To define the day categorie it is necessary to specify list of values, | ||
- | # value can be one of following: | ||
- | # | ||
- | # | ||
- | # | ||
- | # | ||
- | # | ||
- | # | ||
- | |||
- | workdays: | ||
- | weekend_eve: | ||
- | weekend: | ||
- | |||
- | holidays: | ||
- | - " | ||
- | - " | ||
- | - " | ||
- | - " | ||
- | - " | ||
- | - " | ||
- | - " | ||
- | - " | ||
- | </ | ||
- | |||
- | Time classes divide the day of a certain category into time intervals. To specify the time class, you should create the '' | ||
- | \\ | ||
- | In the example below, the '' | ||
- | |||
- | <code bash> | ||
- | # | ||
- | # Time classes | ||
- | time_classes: | ||
- | default: | ||
- | # All others items define the names of time classes and rules for them definition | ||
- | # Each rule is started from the name of day categorie and contains | ||
- | # the list of ranges of times in form " | ||
- | |||
- | peak: | ||
- | workdays: | ||
- | - " | ||
- | - " | ||
- | |||
- | weekend_eve: | ||
- | - " | ||
- | - " | ||
- | |||
- | weekend: | ||
- | - " | ||
- | - " | ||
- | |||
- | holidays: | ||
- | - " | ||
- | - " | ||
- | </ | ||
- | ==== Jobs ==== | ||
- | The job parameters are specified as nested '' | ||
- | * '' | ||
- | * '' | ||
- | * '' | ||
- | |||
- | Further, when describing, the full parameter name will be based on the '' | ||
- | |||
- | === Monitoring === | ||
- | For the monitoring job, you should specify the incoming connection parameters ('' | ||
- | \\ | ||
- | In other words, the '' | ||
- | \\ | ||
- | The monitoring job, in addition to gathering statistics of the processes operating with the cache, also gathers system information on network interfaces. In order to limit the list of network interfaces used to gather statistics, it is necessary to specify them in a list in the '' | ||
- | |||
- | <code bash> | ||
- | monitor: | ||
- | listener: | ||
- | # incoming/ | ||
- | host: | ||
- | port: 1600 | ||
- | |||
- | network_interfaces: | ||
- | # incoming/ | ||
- | # if list is empty, all interfaces except " | ||
- | - | ||
- | </ | ||
- | |||
- | === Load === | ||
- | Load parameters are divided into generic ones, which are specified as fields of the '' | ||
- | |||
- | Generic load parameters: | ||
- | <code bash> | ||
- | # | ||
- | # Load jobs parameters | ||
- | load: | ||
- | ip_binding: | ||
- | # to binding while loading | ||
- | - | ||
- | |||
- | ignored_clients: | ||
- | # list could be defined internally into configuration file as list of CIDRs | ||
- | # under key " | ||
- | # if list is defined externally in file, each line that file must contain | ||
- | # only one CIDR | ||
- | cidr_list: | ||
- | - | ||
- | cidr_files: | ||
- | - | ||
- | |||
- | rate_limits: | ||
- | # value can be one of following: | ||
- | # | ||
- | # | ||
- | offpeak: | ||
- | peak: 1m | ||
- | | ||
- | # ... skipping ... | ||
- | </ | ||
- | |||
- | The '' | ||
- | \\ | ||
- | The '' | ||
- | \\ | ||
- | The '' | ||
- | |||
- | Offline-load parameters: | ||
- | <code bash> | ||
- | # | ||
- | # Load jobs parameters | ||
- | load: | ||
- | # ... skipping ... | ||
- | offline: | ||
- | parallel_workers: | ||
- | job_awaiting_time: | ||
- | # if after this time no one job is in the queue, | ||
- | # worker finishes its execution | ||
- | </ | ||
- | |||
- | Online-load parameters can be divided into the following logical groups: | ||
- | * Descriptions of IPFIX stream sources | ||
- | * Description of the input IPFIX flow analysis | ||
- | * Description of the requested objects collectors | ||
- | * Description of data loading processes | ||
- | |||
- | Descriptions of IPFIX stream sources are specified in the '' | ||
- | <code bash> | ||
- | # | ||
- | # Load jobs parameters | ||
- | load: | ||
- | # ... skipping ... | ||
- | | ||
- | online: | ||
- | exporters: | ||
- | main: | ||
- | queue_size: 1000 # optional, max number of messages from exporter which is sent to analyzing queue, | ||
- | # must be between 1 and 100000, default: 1000 | ||
- | |||
- | host: | ||
- | port: | ||
- | protocol: | ||
- | |||
- | information_elements: | ||
- | # names are reserved, values must be in form " | ||
- | timestamp: | ||
- | host: " | ||
- | path: " | ||
- | login: | ||
- | source_ip4: | ||
- | destination_ip4: | ||
- | referal: | ||
- | user_agent: | ||
- | cookie: | ||
- | |||
- | # ... skipping ... | ||
- | </ | ||
- | |||
- | To create a description of the IPFIX stream source, you should create the '' | ||
- | * '' | ||
- | * '' | ||
- | |||
- | The input IPFIX flow analysis description is specified in the '' | ||
- | <code bash> | ||
- | # | ||
- | # Load jobs parameters | ||
- | load: | ||
- | # ... skipping ... | ||
- | | ||
- | online: | ||
- | # ... skipping ... | ||
- | | ||
- | analyzing: | ||
- | # and binding URLs getting from exporters to cache' | ||
- | # each analyzzng process sends valid URLs to collection queue | ||
- | parallel_workers: | ||
- | queue_size: | ||
- | # which is sent to collecting queue, must be between 1 and 100000, | ||
- | # default: 1000 | ||
- | |||
- | # ... skipping ... | ||
- | </ | ||
- | |||
- | When specifying the analyzer parameters, you should focus initially on the '' | ||
- | |||
- | The description of the requested objects collectors is specified in the '' | ||
- | <code bash> | ||
- | # | ||
- | # Load jobs parameters | ||
- | load: | ||
- | # ... skipping ... | ||
- | | ||
- | online: | ||
- | # ... skipping ... | ||
- | |||
- | collectors: | ||
- | default: | ||
- | # and distributes them into time windows | ||
- | # each event could hit no more than 1 window according to event' | ||
- | # the time window is a period of time which defines interval of time in the past | ||
- | # it is possible to have many time windows, in this case each window is | ||
- | # connected to previous one | ||
- | |||
- | slots: | ||
- | window: | ||
- | |||
- | week_by_4_hours: | ||
- | slots: | ||
- | window: | ||
- | |||
- | # ... skipping ... | ||
- | </ | ||
- | |||
- | The collector of the requested objects is the main factor for deciding whether to start the object download. Logically, the collector is a collection of counters in the sliding time intervals in the past from the present. When an object is requested, the counter is incremented in a certain interval, and as soon as the total value of all object counters reaches the threshold value specified within the [[#Cache description|cache description]], | ||
- | \\ | ||
- | In order to create the collector description, | ||
- | * '' | ||
- | * '' | ||
- | |||
- | In the example above, two collectors are described: '' | ||
- | |||
- | The description of the data loading processes is specified in the '' | ||
- | <code bash> | ||
- | # | ||
- | # Load jobs parameters | ||
- | load: | ||
- | # ... skipping ... | ||
- | | ||
- | online: | ||
- | # ... skipping ... | ||
- | |||
- | loading: | ||
- | parallel_workers: | ||
- | queue_size: | ||
- | # which is sent to loading queue, must be between " | ||
- | # default: 100 | ||
- | unbuffered_queue_size: | ||
- | # collecting process will put into loading queue immediately. The messages will be buffered | ||
- | # and ordered before puting into loading queue according to URLs' weight; | ||
- | # must be between 0 and " | ||
- | |||
- | # ... skipping ... | ||
- | </ | ||
- | |||
- | When specifying the load process parameters you should pay attention to the number of parallel processes. Do not significantly increase the queue size ( the '' | ||
- | |||
- | === Scanning === | ||
- | Scanning is an auxiliary job that starts when loading and deleting data from the cache. | ||
- | |||
- | <code bash> | ||
- | # | ||
- | # Scannig jobs parameters | ||
- | scan: | ||
- | workers: | ||
- | # each scanning process sends information about found items to scanning queue | ||
- | parallel_workers: | ||
- | job_queue_size: | ||
- | # default: 5000 | ||
- | result_queue_size: | ||
- | # which are sent to result scanning queue, must be between 1000 and 1000000, | ||
- | # default: 100000 | ||
- | </ | ||
- | |||
- | The functions performed by the scan task include the following: | ||
- | * initial directories scanning when the load process starts or when old objects are removed; | ||
- | * checking the lifetime of object in the cache and removing it, if necessary; | ||
- | * removal of objects on request from the control process; | ||
- | * checking the compliance of the object stored in the cache with its original; | ||
- | |||
- | ==== Storage ==== | ||
- | Storage parameters are specified as the '' | ||
- | * storage general characteristics - the '' | ||
- | * CACHEs parameters description - the '' | ||
- | |||
- | General parameters specify the path to the storage root, as well as the maximum size that can be allocated to all the objects of the cache: | ||
- | <code bash> | ||
- | storage_parameters: | ||
- | general: | ||
- | path: "/ | ||
- | max_size: | ||
- | # maximum data size for all caches, possible suffixes are: | ||
- | # Kb, Mb, Gb, Tb, Pb or without trailing " | ||
- | </ | ||
- | |||
- | The parameters of a particular cache are specified as the '' | ||
- | <code bash> | ||
- | storage_parameters: | ||
- | # ... skipping ... | ||
- | | ||
- | caches: | ||
- | youtube.com: | ||
- | is_enabled: | ||
- | # will not be used for loading data | ||
- | |||
- | # ... skipping ... | ||
- | </ | ||
- | specifies the [[#Cache description|cache description]] using the '' | ||
- | |||
- | === Cache description === | ||
- | The parameter names will be based on the '' | ||
- | \\ | ||
- | The '' | ||
- | |||
- | == Statistics == | ||
- | The cache statistics are specified in the '' | ||
- | <code bash> | ||
- | statistics: | ||
- | group: | ||
- | # all statistics in the same group is aggregated together | ||
- | collector: | ||
- | </ | ||
- | |||
- | The '' | ||
- | |||
- | == Online-loading == | ||
- | For the online-loading, | ||
- | <code bash> | ||
- | online: | ||
- | collector: | ||
- | # for online processing | ||
- | validating: | ||
- | interval: | ||
- | # could be used only for caches with " | ||
- | </ | ||
- | |||
- | The '' | ||
- | |||
- | == Loading == | ||
- | The following parameters are used to specify the load rules: | ||
- | <code bash> | ||
- | loading: | ||
- | algorithm: | ||
- | required_weight: | ||
- | |||
- | urls: | ||
- | matching: | ||
- | # - the " | ||
- | # - the " | ||
- | # | ||
- | # - the " | ||
- | # all items under " | ||
- | # the " | ||
- | # the sources URLs will not be rewritten. | ||
- | # it is necessary to have at least one URL in sources. | ||
- | # It is possible to use in the key and target' | ||
- | # See RE syntax on https:// | ||
- | |||
- | - | ||
- | weight: | ||
- | sources: | ||
- | - ' | ||
- | - ' | ||
- | |||
- | ignoring: | ||
- | - # default: all matched with matching parameters URLs will be processed | ||
- | |||
- | loadable_rejecting: | ||
- | # which is obtained according to particular URL | ||
- | # default: all " | ||
- | # See RE syntax on https:// | ||
- | |||
- | - " | ||
- | </ | ||
- | |||
- | * The '' | ||
- | * // | ||
- | * // | ||
- | * //vk.com// - to download video files from vk.com; | ||
- | * //general// - general algorithm for loading the source URL; | ||
- | * The '' | ||
- | * The '' | ||
- | * '' | ||
- | * '' | ||
- | * '' | ||
- | * '' | ||
- | | ||
- | * The '' | ||
- | * The '' | ||
- | |||
- | == Storage == | ||
- | The following parameters are used to specify the rules for storing objects: | ||
- | <code bash> | ||
- | storage: | ||
- | path: " | ||
- | levels: | ||
- | # of md5 sum of file name | ||
- | # possible values: | ||
- | # | ||
- | # | ||
- | # | ||
- | # | ||
- | # | ||
- | # | ||
- | # | ||
- | # | ||
- | max_size: | ||
- | # maximum data size, possible suffixes are: | ||
- | # Kb, Mb, Gb, Tb, Pb or without trailing " | ||
- | expiry_time: | ||
- | # default in seconds (without suffix) | ||
- | # possible suffixes are: | ||
- | # | ||
- | # | ||
- | # | ||
- | </ | ||
- | |||
- | * The '' | ||
- | * The '' | ||
- | * The '' | ||
- | * The '' | ||
- | |||
- | == Constraints == | ||
- | For cache objects, you can set constraints, | ||
- | <code bash> | ||
- | constraints: | ||
- | min_file_size: | ||
- | # Kb, Mb, Gb, Tb, Pb or without trailing " | ||
- | max_file_size: | ||
- | # Kb, Mb, Gb, Tb, Pb or without trailing " | ||
- | |||
- | # optional, command which should be executed after file loading; | ||
- | # if command returns non-zero result, the loaded file will be assumed as invalid an will be removed; | ||
- | # the next variables could be used in the command: | ||
- | # | ||
- | # | ||
- | post_load_validation: | ||
- | </ | ||
- | |||
- | ==== SSD-Caching ===== | ||
- | It is possible to get caching of the the most frequently requested storage objects on an SSD, to do so you should define the following parameters: | ||
- | |||
- | <code bash> | ||
- | # | ||
- | # SSD caching parameters | ||
- | ssd_caching: | ||
- | is_enabled: | ||
- | path: "/ | ||
- | # NOTES: | ||
- | # - to SSD caching you have to mount SSD on specified path (or create link / | ||
- | # - changes this parameter should be performed in cooperation with changes in web server configuration | ||
- | # (it is recomended do not change this parameter) | ||
- | max_size: | ||
- | # Kb, Mb, Gb, Tb, Pb or without trailing " | ||
- | required_weight: | ||
- | |||
- | uri_prefixes: | ||
- | ssd_cache_requests: | ||
- | main_storage_requests: | ||
- | |||
- | frozen_time: | ||
- | # file cannot be replaced by another one, default: 3 * collecting window | ||
- | # frozen time cannot be less than collecting window | ||
- | |||
- | collector: | ||
- | # and distributes them into time windows | ||
- | slots: | ||
- | window: | ||
- | |||
- | workers: | ||
- | # each process sends information about found items to result queue | ||
- | parallel_workers: | ||
- | job_queue_size: | ||
- | # default: 5000 | ||
- | result_queue_size: | ||
- | # which is sent to result queue, must be between 1000 and 1000000, | ||
- | # default: 100000 | ||
- | </ | ||
- | |||
- | When setting parameters, you should pay attention to the number of working processes and the collector parameters. | ||
- | |||
- | ===== Initial configuration ===== | ||
- | When installing **ccu** the // | ||
- | * '' | ||
- | * '' | ||
- | * '' | ||
- | * '' | ||
- | * '' | ||
- | * '' |