Back to index

4.9.0-0.okd-2022-01-29-035536

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.8.0-0.okd-2023-07-28-033101

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Problem Alignment

The Problem

Customers typically run more than one cluster and/or applications deployed across different regions. In such a hybrid cloud environment, aggregating metrics is a key requirement to avoid admins and or applications owners to drop in into individual clusters to troubleshoot specific problems. And since Red Hat does not offer a standalone metrics aggregation service, customers have started to use existing, home-grown technologies based on, for example, InfluxDB or Kafka to achieve that.

In summary:

  • OpenShift Monitoring is optimized for short-term retention only.
  • Red Hat does not offer a central metrics aggregation service yet.
  • Customers use existing, home-grown technologies to distribute information across other stakeholders in their company.

High-Level Approach

Expose Prometheus remote-write configuration via our OpenShift Monitoring (Cluster and User Workload) ConfigMap to allow customers to push time-series data to a remote location.

Please note that we do not plan to support certain third party “receivers” with this solution. Customers will be responsible to ensure an appropriate receiving component is up and running that implements the “remote-write” API. Here is a list of possible “receiver” plugins.

Goal & Success

  • Introduce some “ease of use” features to configure certain parts for remote-write to decrease possible misconfigurations.
  • Allow customers to push metrics off the cluster to allow aggregation use cases and more options for our partners to integrate into OpenShift - e.g. to allow long-term retention or security/analytics scenarios.

Solution Alignment

Key Capabilities

  • As an OpenShift administrator, I want to configure remote-write for both the OOTB infrastructure bundle and the user workload stack, so that time-series data will be available on the system of my choice.
  • As an OpenShift administrator, I want to easily build an allow list of metrics that should be pushed externally.

Key Flows

User configures one of the available ConfigMaps to allow node_cpu_seconds_total to be written into a remote Thanos system.

  • Administrator opens the cluster-monitoring-config ConfigMap.
  • They add a new field to configure remote write.
  • They add the node_cpu_seconds_total metric to the allow list.
  • They add the remote URL for the Thanos receiver.
  • They add a Secret to configure authentication against the remote service.

Additional resources

Remote write allows to replicate time-series data to a remote location. This is important for several scenarios like you want to use "remote-write enabled" systems (e.g. InfluxDB) for long-term storage and historical analysis; as well as for aggregating metrics across multiple clusters.

Currently, remote-write is in an experimental stage in Prometheus[1] but the chances are high that it will be stable some time this year. Furthermore, we are using remote-write pretty extensively already for Telemetry as well as ACM in the near future. With that in mind, we think that we are in a perfect spot to move what we already have[2] from dev preview to at least tech preview.

Acceptance criteria

  • mTLS support (important for positioning Red Hat's Advanced Cluster Manager (ACM) as they will need it for pushing metrics from OpenShift clusters into their central management solution backed by Observatorium.
  • Default configurations coming from Red Hat (such as Telemetry and ACM) should not be overridden. ACM for example may inject their configuration automatically post installation (mechanism to be discussed).

Non-goals

  • Configuration isolation for cluster and user workload monitoring ConfigMap to allow separating remote-write configuration per "tenant" or "user".
  • Remote write for Thanos Ruler (this isn't supported yet, see https://github.com/thanos-io/thanos/issues/1724).

Open questions

  • Do we want to expose a different API to make configuring an allow list easier for everyone rather than exposing relabeling configuration directly? Reason is that we want to avoid validating "syntax" requests in a BZ or internal.

Documentation

  • New section inside the configuration chapter that describes how to setup remote-write with an example on how it looks like for standard remote write implementation. For both CMO and UWM.
  • A small note about implications on setting up remote-write to the overall Prometheus cluster.
  • How to configure security/auth (e.g. (m)TLS).
  • The API.
  • Tuning.
  • Proxy configuration (if not supported, then we need a statement).

Other resources

[1] https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#prometheusspec - "If specified, the remote_write spec. This is an experimental feature, it may change in any upcoming release in a breaking way." The experimental flag was removed.

We'll want to give user the option to add remote_write configs to both the cluster monitoring and UWM.
AC:

  • decide what features we want to give users
  • decide what API we want to expose to users, i.e. basic rw config, low-level relabel-config, or high-level-streamlined API
  • implement the API

https://issues.redhat.com/browse/MON-1069?focusedCommentId=16252560&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16252560

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

Feature Overview

  • This Section:* High-Level description of the feature ie: Executive Summary
  • Note: A Feature is a capability or a well defined set of functionality that delivers business value. Features can include additions or changes to existing functionality. Features can easily span multiple teams, and multiple releases.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic Goal

  • Enable Image Registry to use Azure Blob Storage from AzureStackCloud

Why is this important?

  • While certifying Azure Stack Hub as OCP provider we need to ensure all the required components for UPI/IPI deployments are ready to be used

Scenarios

  1. Create an OCP cluster is Azure Stack Hub and use Internal Registry with Azure Blob Storage from AzureStackCloud

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Story: As an OpenShift admin I want the internal registry of the cluster use storage from Azure Stack Hub so that I can run a fully supported OpenShift environment on that infrastructure provider.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As a cluster administrator,

I want OpenShift to include a recent CoreDNS version,

so that I have the latest available performance and security fixes.

 

We should strive to follow upstream CoreDNS releases by bumping openshift/coredns with every OpenShift 4.y release, so that OpenShift benefits from upstream performance and security fixes, and so that we avoid large version-number jumps when an urgently needed change necessitates bumping CoreDNS to the latest upstream release. This bump should happen as early as possible in the OpenShift release cycle, so as to maximize soak time.

 

For OpenShift 4.9, this means bumping from CoreDNS 1.8.1 to 1.8.3, or possibly a later release should one ship before we do the bump.

 

Note that CoreDNS upstream does not maintain release branches—that is, once CoreDNS is released, there will be no further 1.8.z releases—so we may be better off updating to 1.9 as soon as it is released, rather than staying on the 1.8 series which would then be unmaintained.

 

We may consider bumping CoreDNS again during the OpenShift 4.9 release cycle if upstream ships additional releases during the 4.9 development cycle. However, we will need to weigh the risks and available remaining soak time in the release schedule before doing so, should that contingency arise.

 

Feature Overview

As a OpenShift administrator, I would like a solution that allows me to upgrade from one EUS version to another with very few steps and only minimum disruption to application workloads while still allowing new application services to be deployed.

Goals

4.8

  • Spike, Design, and Scope
  • Begin foundational development if possible

4.9

  • Foundational items delivered and back ported as necessary

4.10

  • Remaining delivery artifacts complete
  • Documentation and enablement complete
  • Full testing complete

Requirements

Functional requirements break down into the following prioritized list:

 

  1. Make serial upgrades safe
    1. Prevent upgrades before the core components are ready (version skewing, incompatible APIs)
    2. Prevent upgrades before operators or ready
      1. Ensure Operators have a way to express max version
      2. Ensure OLM policy is clear on what happens if max version is not specified
    3. Make back pressure items (reasons you cannot upgrade) clear to administrators along with the actions to resolve
    4. CI MUST be running with test automation
    5. Note: Forcing an upgrade is still possible
  2. Make updates faster
    1. Optimize where possible to increase speed of upgrade for core components (SDN/Daemonsets)
  3. Reduce the amount of workload disruption
    1. Work load disruption is not just reboots it is any disruption to workloads during the upgrade, of which a reboot is likely the worst case scenario.  This may also include things like rescheduling of workloads.
    2. We will not change the model of how components are deployed, changes to the host still require a reboot
    3. Discover and document any necessary guidelines to reduce the number of items that are developed which would cause a reboot between EUS releases where possible (4.8, 4.9).  
    4. As a stretch goal, discover if it is possible to reduce the reboots between 4.6 and 4.7 
  4. Should take into consideration clusters with RHEL workers

 

Non-Functional Requirements

Requirement Notes isMvp?
Release Technical Enablement Provide necessary release enablement details and documents. YES
Documentation This is a requirement for ALL end user facing features YES

Questions to answer…

Out of Scope

  • It is not intended to support version skews that fall outside the upstream version skew policy
  • It is not intended to eliminate all reboots
  • It is not intended to skip releases at this time

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

EUS to EUS Focus Area Discussion: https://docs.google.com/document/d/17I1Wd7-R1wRxmboyv1jUFHFkqQcBTorJccdGi1ZqjQE/edit?usp=sharing

EUS Feature: https://issues.redhat.com/browse/OCPPLAN-5484

Epic Goal

  • Ensure the user experience for upgrades in console supports EUS -> EUS upgrades.

Why is this important?

  • This is a product-wide initiative.

Scenarios

  1. The console cluster settings page should inform administrators of upgrade requirements prior to the first upgrade step.
  2. The console cluster settings page should report problems during an upgrade.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. CVO - Sufficient APIs (ClusterVersion, Alerts) for console to show requirements before an upgrade and problems during an upgrade to an administrator.

Previous Work (Optional):

Open questions::

  1. We have an R&D story to investigate what the console experience should be and what APIs might be necessary.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Use case
As an Admin, one of my operators says it can't be upgraded. An action is required, as I will be unable to upgrade to a .y minor release until I fix the problem.
 
Possible Design Solution 
Create a message saying you can upgrade to .z patch releases even when one of your cluster operators says it's not upgradeable.

Ideally, the message string on the condition explains what the admin needs to resolve , and until they resolve the issue they can only update within their current z stream.

 
Questions
Need to do a little R&D to find out when this happens and what happens when you're in this state.

Designs (WIP)
Doc: https://docs.google.com/document/d/1iUZlHbv5nTYtb7Cq4rn_bYPqD4Jtie59xIogxN-2Eyc/edit#heading=h.5eoflxvaj1m4

Feature Overview

We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.

Goals

  • Feature enhancements (performance, scale, configuration, UX, ...)
  • Modernization (incorporation and productization of new technologies)

Requirements

  • Core Networking Stability
  • Core Networking Performance and Scale
  • Core Neworking Extensibility (Multus CNIs)
  • Core Networking UX (Observability)
  • Core Networking Security and Compliance

In Scope

  • Network Edge (ingress, DNS, LB)
  • SDN (CNI plugins, openshift-sdn, OVN, network policy, egressIP, egress Router, ...)
  • Networking Observability

Out of Scope

There are definitely grey areas, but in general:

  • CNV
  • Service Mesh
  • CNF

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Feature Overview

Plugin teams need a mechanism to extend the OCP console that is decoupled enough so they can deliver at the cadence of their projects and not be forced in to the OCP Console release timelines.

The OCP Console Dynamic Plugin Framework will enable all our plugin teams to do the following:

  • Extend the Console
  • Deliver UI code with their Operator
  • Work in their own git Repo
  • Deliver at their own cadence

Goals

    • Operators can deliver console plugins separate from the console image and update plugins when the operator updates.
    • The dynamic plugin API is similar to the static plugin API to ease migration.
    • Plugins can use shared console components such as list and details page components.
    • Shared components from core will be part of a well-defined plugin API.
    • Plugins can use Patternfly 4 components.
    • Cluster admins control what plugins are enabled.
    • Misbehaving plugins should not break console.
    • Existing static plugins are not affected and will continue to work as expected.

Out of Scope

    • Initially we don't plan to make this a public API. The target use is for Red Hat operators. We might reevaluate later when dynamic plugins are more mature.
    • We can't avoid breaking changes in console dependencies such as Patternfly even if we don't break the console plugin API itself. We'll need a way for plugins to declare compatibility.
    • Plugins won't be sandboxed. They will have full JavaScript access to the DOM and network. Plugins won't be enabled by default, however. A cluster admin will need to enable the plugin.
    • This proposal does not cover allowing plugins to contribute backend console endpoints.

 

Requirements

 

Requirement Notes isMvp?
 UI to enable and disable plugins    YES 
 Dynamic Plugin Framework in place    YES 
Testing Infra up and running   YES 
 Docs and read me for creating and testing Plugins    YES 
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 
 Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

We need to support localization of dynamic plugins. The current proposal is to have one i18n namespace per dynamic plugin with a fixed name: `${plugin-name}-plugin`. Since console will know the list of plugins on startup, it can add these namespaces to the i18next config.

The console backend will need to implement an endpoint at the i18next load path. The endpoint will see if the namespace matches the known plugin namespaces. If so, it will proxy to the plugin. Otherwise it will serve the static file from the local filesystem.

We need a UI for enabling and disabling dynamic plugins. The plugins will be discovered either through a custom resource or an annotation on the operator CSV. The enabled plugins will be persisted through the operator config (consoles.operator.openshift.io).

This story tracks enabling and disabling the plugin during operator install through Cluster Settings. This is needed in the future if a plugin is installed outside of an OLM operator.

UX design: https://github.com/openshift/openshift-origin-design/pull/536 

The dynamic plugins enhancement describes a `disable-plugins` query parameter for disabling specific console plugins.

  • ?disable-plugins or ?disable-plugins= prevents loading of any dynamic plugins (disable all)
  • ?disable-plugins=foo,bar prevents loading of dynamic plugins named foo or bar (disable selectively)

This has no effect on static plugins, which are built into the Console application.

https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md#error-handling

Feature Overview

  • This Section:* High-Level description of the feature ie: Executive Summary
  • Note: A Feature is a capability or a well defined set of functionality that delivers business value. Features can include additions or changes to existing functionality. Features can easily span multiple teams, and multiple releases.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As a admin, I want to be able to access the node logs from the nodes detail page in order to troubleshoot what is going on with the node.

We should support getting node logs for different units for node journal logs and evaluate the other CLI flags.

We currently have a gap with the CLI:

  •   oc adm node-logs [-l LABELS] [NODE...] [flags]

We need to investigate whether the k8s API supports WebSockets for streaming node logs.

Goal

Currently we are showing system projects within the list view of the Projects page. As stated here https://issues.redhat.com/browse/RFE-185, there are many projects that are considered as system projects that are not important to the user. The value should be remember across sessions, but it something we should be able to toggle directly from the list.

Design assets

Design doc

Marvel

Requirements

  • The user should be able to hide/show system projects within the project list page (and namespace list page)
  • The user should be able to hide/show system projects from the project selector
  • The same capability should work from the project list page in the developer perspective

In OpenShift, reserved namespaces are `default`, `openshift`, and those that start with `openshift-`, `kubernetes-`, or `kube-`.

Edge case scenarios

  • If the user filters out system projects from the projects or namespaces list view, then filters and there are no results, an empty state will be surfaced with ability to clear filters. (see design assets)
  • If the user has hidden system projects from the project selector and has favorited or defaulted system projects in the project selector, those favorited or defaulted system projects will NOT appear in the project selector list. (see design assets)
  • If the user has hidden system projects from the project selector, then navigates to some resource page where a system project is selected, the system project name will still appear in the project selector toggle but not within the list of projects in the selector. (see design assets)

Goal
By default the Cluster Utilization card should not include metrics from `master` nodes in its queries for CPU, Memory, Filesystem, Network, and Pod count.

A new filter option should allow users to toggle between a combined view of what is seen on the Cluster Utilization card today, which is mostly useful on small clusters where masters are schedulable for user workloads.

Assets

  • Marvel with two scenarios:
    • Windows nodes exist
    • Windows nodes do not exist

Background

As discussed in this thread, the`kube_node_role` metric available since 4.3 should allow us to filter the card's PromQL queries to not include master node metrics.

This filtered view would likely make the card's data more useful for users who aren't running their workloads on masters, like OpenShift Dedicated users.

As noted by some folks during design discussions, this filter isn't perfect, and wouldn't filter out the data from "Infra" nodes that users may have set up using labels/taints. Until we determine a good way to provide more advanced filtering, this basic "Include masters" checkbox is still more flexible than what the card offers today.

Requirements

  • When windows nodes exist in the cluster:
    • Node type filter will be added to the Cluster Utilization card that lists all node types available
    • It will be pre-filtered to only show Worker nodes
    • The filter will be single select and will display the selected item in the toggle.
  • When windows nodes do not exist in the cluster:
    • Node type filter will be added to the Cluster Utilization card that lists node types available, plus an "all types" item.
    • It will be pre-filtered to only show Worker nodes
    • The filter will be multi-select
    • The badge in the toggle will update as more items are selected
    • If the "all nodes" is selected, the other items will automatically become deselected, and the badge will update to "All".

Feature Overview

OpenShift console supports new features and elevated experience for Operator Lifecycle Manager (OLM) Operators and Cluster Operators.

Goal:

OCP Console improves the controls and visibility for managing vendor-provided software in customers’ infrastructure and making these solutions available for customers' internal users.

 

To achieve this, 

  • Operator Lifecycle Manager (OLM) teams have been introducing new features aiming towards simplification and ease of use for both developers and cluster admins.
  • On the Cluster Operators side, the console iteratively improves the visibilities to the resources being associated with the Operators to improve the overall managing experience.

We want to make sure OLM’s and Cluster Operators' new features are exposed in the console so admin console users can benefit from them.

Benefits:

  • Cluster admin/Operator consumers:
    • Able to see, learn, and interact with OLM managed and/or Cluster Operators associated resources in openShift console.

Requirements

Requirement Notes isMvp?
OCP console supports the latest OLM APIs and features This is a requirement for ALL features. YES
OCP console improves visibility to Cluster Operators related resources and features. This is a requirement for ALL features. YES
     

 


(Optional) Use Cases
<--- Remove this text when creating a Feature in Jira, only for reference --->
* Main success scenarios - high-level user stories
* Alternate flow/scenarios - high-level user stories
* ...

Questions to answer...
How will the user interact with this feature?
Which users will use this and when will they use it?
Is this feature used as part of the current user interface?

Out of Scope
<--- Remove this text when creating a Feature in Jira, only for reference --->
# List of non-requirements or things not included in this feature
# ...

Background, and strategic fit
<--- Remove this text when creating a Feature in Jira, only for reference --->
What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions
<--- Remove this text when creating a Feature in Jira, only for reference --->
* Are there assumptions being made regarding prerequisites and dependencies?
* Are there assumptions about hardware, software or people resources?
* ...

Customer Considerations
<--- Remove this text when creating a Feature in Jira, only for reference --->
* Are there specific customer environments that need to be considered (such as working with existing h/w and software)?
...

Documentation Considerations
<--- Remove this text when creating a Feature in Jira, only for reference --->
Questions to be addressed:
* What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
* Does this feature have doc impact?
* New Content, Updates to existing content, Release Note, or No Doc Impact
* If unsure and no Technical Writer is available, please contact Content Strategy.
* What concepts do customers need to understand to be successful in [action]?
* How do we expect customers will use the feature? For what purpose(s)?
* What reference material might a customer want/need to complete [action]?
* Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
* What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic Goal

  • OCP console supports devs to easier focus and create Operand/CR instances on the creation form page.
  • OCP console supports cluster admins to better see/understand the Operator installation status in the OperatorHub page.

Why is this important?

  • OperatorHub page currently shows an Operator as Installed as long as a Subscription object exists for that operator in the current namespace, which can be misleading because the installation could be stalled or require additional interactions from the user (e.g. "manual upgrade approval") in order to complete the installation.
  • Some Operator managed services use these advanced properties in their CRD validation schema, but the current form generator in the console ignores/skips them. Hence, those fields on the creation form are missing.

Scenarios

  1. As a user of OperatorHub, I'd like to have an improved "status display" for Operators being installed before so I can better understand if those Operators actually being successfully installed or require additional actions from me to complete the installation.
  2. As a user of the OCP console, I'd like to Operand/CR creation form that covers advanced JSONSchema validation properties so I can create a CR instance solely with the form view.

Acceptance Criteria

  • Console improves the visibility of Operator installation status on OperatorHub page
  • Console operand creation form adds support for `allOf`, `anyOf`, `oneOf`, and `additionalProperties` JSONSchema validation keywords so the creation form UI can render them and not skipping those properties/fields.
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

  • Options

 

OLM is adding a property to the CSV to signal that the operator should clean up the operand on operator uninstall. See https://github.com/operator-framework/enhancements/pull/46

Console will need to add a checkbox to the UI to prompt ask the user if the operand should be cleaned up (with strong warnings about what this means). On delete, console should set the `spec.cleanup` property on the CSV to indicate whether cleanup should happen.

Additionally, console needs to be able to show proper status for CSVs that are terminating in the UI so it's clear the operator is being deleted and cleanup is in progress. If there are errors with cleanup, those should be surfaced back through the UI.

Depends on OLM-1733

cc Ali Mobrem Tony Wu Daniel Messer Peter Kreuser

User Story

As a user of OperatorHub, I'd like to have an improved "status display" for Operators being installed before so I can better understand if those Operators actually being successfully installed or require additional actions from me to complete the installation.

Desired Outcome

Improve visibility of Operator installation status on OperatorHub page

Why this is important?

OperatorHub page currently shows an Operator as Installed as long as a Subscription object exists for that operator in the current namespace.

This can be misleading because the installation could be stalled or require additional interactions from the user (e.g. "manual upgrade approval") in order to complete the installation.

The console could potentially have some indication of an "in-between" or "requires attention" state for Operators that are in these states + links to the actual "Installed Operators" page for more details.

Related Info:

1. BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1899359
2. RFE: https://issues.redhat.com/browse/RFE-1691

Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster. 
Why customers want this?

  1. Single interface to accomplish their tasks
  2. Consistent UX and patterns
  3. Easily accessible: One URL, one set of credentials

Why we want this?

  • Shared code -  improve the velocity of both teams and most importantly ensure consistency of the experience at the code level
  • Pre-built PF4 components
  • Accessibility & i18n
  • Remove barriers for enabling ACM

Phase 1 Goal: Get something to market (OCP 4.8, ACM 2.3)
Phase 1 —> OCP deploys ACM Hub Operator —> ACM Perspective becomes available —> User can switch between ACM multi-cluster view and local OCP Console —> No SSO user has to login in twice

Phase 2 Goal: Productization of the united Console (OCP 4.9, ACM 2.4)

  1. Enable user to quickly change context from fleet view to single cluster view
    1. Add Cluster selector with “All Cluster” Option. “All Cluster” = ACM
    2. Shared SSO across the fleet
    3. Hub OCP Console can connect to remote clusters API
    4. When ACM Installed the user starts from the fleet overview aka “All Clusters”
  2. Share UX between views
    1. ACM Search —> resource list across fleet -> resource details that are consistent with single cluster details view
    2. Add Cluster List to OCP —> Create Cluster

Phase 2  Use Cases:

  1.  As a user, I want to be able to quickly switch context from the Fleet view(ACM) to any spoke cluster Console view all from the same web browser tab.
    1. ACM Hub Operator deployed to OCP—> Cluster picker become available, with “All cluster option”= ACM —> Single cluster user will get perspective picker(Admin, Dev) —> User needs the ability to quickly change context to single cluster —> All clusters should be linked via shared SSO
    2.   
  2. As a user, I should be able to drill down into resources in the ACM view and get the OCP resource details page
    1. ACM Hub Operator deployed to OCP—> User Searches for pods from the ACM view("All clusters")--> Single pod is selected --> OCP pod detail page

We need to coordinate with the ACM team so that the masthead looks the same when switching between contexts. This might require us to consume a common masthead component in OCP console.

The ACM team will need to honor our custom branding configuration so that the logo does not change when switching contexts.

Known differences:

  • Branding customization
  • Console link CRDs
  • Global notifications
  • Import button
  • Notification drawer
  • Language preferences
  • Search link (ACM only)
  • Web terminal (ACM only)

Open questions:

  • How do we handle alerts in the notification drawer across cluster contexts?

OCP/Telco Definition of Done
Feature Template descriptions and documentation.
Feature Overview

  • Connect OpenShift workloads to Google services with Google Workload Identity

Goals

  • Customers want to be able to manage and operate OpenShift on Google Cloud Platform with workload identity, much like they do with AWS + STS or Azure + workload identity.
  • Customers want to be able to manage and operate operators and customer workloads on top of OCP on GCP with workload identity.

Requirements

  • Add support to CCO for the Installation and Upgrade using both UPI and IPI methods with GCP workload identity.
  • Support install and upgrades for connected and disconnected/restriction environments.
  • Support the use of Operators with GCP workload identity with minimal friction.
  • Support for HyperShift and non-HyperShift clusters.
  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

Epic Goal

  • Complete the implementation for GCP  workload identity, including support and documentation.

Why is this important?

  • Many customers want to follow best security practices for handling credentials.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Investigate if this will work for OpenShift components similar to how we implemented STS.

Can we distribute credentials fashion that is transparent to the callers (as to whether it is normal service account of a short lived token) like we did for AWS?

What changes would be required for operators?

Can ccoctl do the heavy lifting as we did for AWS?

Goal:

Enable and support Multus CNI for microshift.

Background:

Customers with advanced networking requirement need to be able to attach additional networks to a pod, e.g. for high-performance requirements using SR-IOV or complex VLAN setups etc.

Requirements:

  1. opt-in approach: customers can add multus if needed, e.g. by installing/adding "microshift-networking-multus" rpm package to their installation.
  2. if possible, it would be good to be able to add multus  an existing installation. If that requires a restart/reboot, that is acceptable. If not possible, it has to be clearly documented. 
  3. it is acceptable that once multus has been added to an installation, it can not be removed. If removal can be implemented easily, that would be good. If not possible, then it has to be clearly documented.  
  4. Regarding additional networks:
    1. As part of the MVP, the Bridge plugin must be fully supported  
    2. As stretch goal,  macvlan and ipvlan plugins  should be supported
    3. Other plugins, esp. host device and sr-iov are out of scope for the MVP, but will be added with a later version.
    4. Multiple additional networks need to be configurable, e.g. two different bridges leading to two different networks, each consumed by different pods.
  5. IP V6 with bridge plugin. Secondary NICs passed to a container via the bridge plugin should work with IP V6, if the consuming pod does support V6. See also "out of scope"
  6. Regarding IPAM CNI support for IP address provisioning, static and DHCP must be supported.

Documentation:

  1. In the existing "networking" book, we need a new chapter "4. Multiple networks". It can re-use a lot content from OCP doc "https://docs.openshift.com/container-platform/4.14/networking/multiple_networks/understanding-multiple-networks.html", but needs an extra chapter in the beginning "Installing support for multiple networks"{}

Testing:

  1. A simple "smoke test" that multus can be added, and a 2nd nic added to a pod (e.g. using host device) is sufficient. No need to replicate all the multus tests from OpenShift, as we assume that if it works there, it works with MicroShift.

Customer Considerations:

  • This document contains the MVP requirements of a MicroShift EAP customer that need to be considered.

Out of scope:

  • Other plugins, esp. host device and sr-iov are out of scope for the MVP, but will be added with a later version.
  • IP V6 support with OVN-K. That is scope of  feature OCPSTRAT-385 

 

Epic Goal

  • Provide optional Multus CNI for MicroShift

Why is this important?

  • Customers need to add extra interfaces directly to Pods

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

It should include all CNIs (bridge, macvlan, ipvlan, etc.) - if we decide to support something, we'll just update scripting to copy those CNIs

It needs to include IPAMs: static, dynamic (DHCP), host-local (we might just not copy it to host)

RHEL9 binaries only to save space

 

Feature Overview

Enable sharing ConfigMap and Secret across namespaces

Requirements

Requirement Notes isMvp?
Secrets and ConfigMaps can get shared across namespaces   YES

Questions to answer…

NA

Out of Scope

NA

Background, and strategic fit

Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them. 

Documentation Considerations

Questions to be addressed:
 * What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
 * Does this feature have doc impact?
 * New Content, Updates to existing content, Release Note, or No Doc Impact
 * If unsure and no Technical Writer is available, please contact Content Strategy.
 * What concepts do customers need to understand to be successful in [action]?
 * How do we expect customers will use the feature? For what purpose(s)?
 * What reference material might a customer want/need to complete [action]?
 * Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
 * What is the doc impact (New Content, Updates to existing content, or Release Note)?

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Allow ConfigMaps and Secrets (resources) to be mounted as volumes in a build

Why is this important?

  • Secrets and ConfigMaps can be added to builds as "source" code that can leak into the resulting container image
  • When using sensitive credentials in a build, accessing secrets as a mounted volume ensure that these credentials are not present in the resulting container image.

Scenarios

  1. Access private artifact repositories (Artifactory, jFrog, Mavein)
  2. Download RHEL packages in a build

Acceptance Criteria

  • Builds can mount a Secret or ConfigMap in a build
  • Content in the secret or ConfigMap are not present in the resulting container image.

Dependencies (internal and external)

  1. Buildah - support mounting of volumes when building with a Dockerfile

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

Reduce the OpenShift platform and associated RH provided components to a single physical core on Intel Sapphire Rapids platform for vDU deployments on SingleNode OpenShift.

Goals

  • Reduce CaaS platform compute needs so that it can fit within a single physical core with Hyperthreading enabled. (i.e. 2 CPUs)
  • Ensure existing DU Profile components fit within reduced compute budget.
  • Ensure existing ZTP, TALM, Observability and ACM functionality is not affected.
  • Ensure largest partner vDU can run on Single Core OCP.

Requirements

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES
 
Provide a mechanism to tune the platform to use only one physical core. 
Users need to be able to tune different platforms.  YES 
Allow for full zero touch provisioning of a node with the minimal core budget configuration.   Node provisioned with SNO Far Edge provisioning method - i.e. ZTP via RHACM, using DU Profile. YES 
Platform meets all MVP KPIs   YES

(Optional) Use Cases

  • Main success scenario: A telecommunications provider uses ZTP to provision a vDU workload on Single Node OpenShift instance running on an Intel Sapphire Rapids platform. The SNO is managed by an ACM instance and it's lifecycle is managed by TALM.

Questions to answer...

  • N/A

Out of Scope

  • Core budget reduction on the Remote Worker Node deployment model.

Background, and strategic fit

Assumptions

  • The more compute power available for RAN workloads directly translates to the volume of cell coverage that a Far Edge node can support.
  • Telecommunications providers want to maximize the cell coverage on Far Edge nodes.
  • To provide as much compute power as possible the OpenShift platform must use as little compute power as possible.
  • As newer generations of servers are deployed at the Far Edge and the core count increases, no additional cores will be given to the platform for basic operation, all resources will be given to the workloads.

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
    • Administrators must know how to tune their Far Edge nodes to make them as computationally efficient as possible.
  • Does this feature have doc impact?
    • Possibly, there should be documentation describing how to tune the Far Edge node such that the platform uses as little compute power as possible.
  • New Content, Updates to existing content, Release Note, or No Doc Impact
    • Probably updates to existing content
  • If unsure and no Technical Writer is available, please contact Content Strategy. What concepts do customers need to understand to be successful in [action]?
    • Performance Addon Operator, tuned, MCO, Performance Profile Creator
  • How do we expect customers will use the feature? For what purpose(s)?
    • Customers will use the Performance Profile Creator to tune their Far Edge nodes. They will use RHACM (ZTP) to provision a Far Edge Single-Node OpenShift deployment with the appropriate Performance Profile.
  • What reference material might a customer want/need to complete [action]?
    • Performance Addon Operator, Performance Profile Creator
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
    • N/A
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
    • Likely updates to existing content / unsure

Goals

  • Expose a mechanism to allow the Monitoring stack to be more a "collect and forward" stack instead of a full E2E Monitoring solution.
  • Expose a corresponding configuration to allow sending alerts to a remote Alertmanager in case a local Alertmanager is not needed.
    • Support proxy environments with also proxy envs.
  • The overall goal is to fit all platform components into 1 core (2 HTs, 2 CPUs) for single node openshift deployments. The monitoring stack is one of the largest cpu consumers on single node openshift consuming ~ 200 mc at steady state, primarilty prometheus and the node exporter. This epic would track optimizations to the monitoring stack to reduce this usage as much as possible. Two items to be explored: 
    • Reducing the scrape interval 
    • Reducing the number of series to be scraped 

Non-Goals

  • Switching off all Monitoring components.
  • Reducing metrics from any component not owned by the Monitoring team.

Motivation

Currently, OpenShift Monitoring is a full E2E solution for monitoring infrastructure and workloads locally inside a single cluster. It comes with everything that an SRE needs from allowing to configure scraping of metrics to configuring where alerts go.

With deployment models like Single Node OpenShift and/or resource restricted environments, we now face challenges that a lot of the functions are already available centrally or are not necessary due to the nature of a specific cluster (e.g. Far Edge). Therefore, you don't need to deploy components that expose these functions.

Also, Grafana is not FIPS compliant, because it uses PBKDF2 from x/crypto to derive a 32 byte key from a secret and salt, which is then used as the encryption key. Quoting https://bugzilla.redhat.com/show_bug.cgi?id=1931408#c10 "it may be a problem to sell Openshift into govt agencies if
grafana is a required component."

Alternatives

We could make the Monitoring stack as is completely optional and provide a more "agent-like" component. Unfortunately, that would probably take much more time and in the end just reproduce what we already have just with fewer components. It would also not reduce the amount of samples scraped which has the most impact on CPU usage.

Acceptance Criteria

  • Verify that all alerts fire against a remote Alertmanager when a user configures that option.
  • Verify that Alertmanager is not deployed when a user configures that option in the cluster-monitoring-operator configmap.
  • Verify that if you have a local Alertmanager deployed and a user decides to use a remote Alertmanager, the Monitoring stack sends alerts to both destinations.
  • Verify that Grafana is not deployed when a user configures that option in the cluster-monitoring-operator configmap.
  • Verify that Prometheus fires alerts against an external Alertmanager in proxy environments (1) configure proxy settings inside CMO and (2) cluster-wide proxy settings through ENV.

Risk and Assumptions

Documentation Considerations

  • Any additions to our ConfigMap API and their possible values.

Open Questions

  • If we set a URL for a remote Alertmanager, how are we handle authentication?
  • Configuration of remote Alertmanagers would support whatever Prometheus supports (basic auth, client TLS auth and bearer token)

Additional Notes

Use cases like single-node deployments (e.g. far-edge) don't need to deploy a local Alertmanager cluster because alerts are centralized at the core (e.g. hub cluster), running Alertmanager locally takes resources from user workloads and adds management overhead. Cluster admins should be able to not deploy Alertmanager as a day-2 operation.

DoD

  • Alertmanager isn't deployed when switched off in the CMO configmap.
  • The OCP console handles the situation gracefully when Alertmanager isn't installed informing users that it can't manage the local Alertmanager configuration and resources.
    • Silences page
    • Alertmanager config editor

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

 

Monitoring needs to be reliable and is the very useful when trying to debug clusters in an already degraded state. We want to ensure that metrics scraping can always work if the scraper can reach the target, even if the kube-apiserver is unavailable or unreachable. To do this, we will combine a local authorizer (already merged in many binaries and the rbac-proxy) and client-cert based authentication to have a fully local authentication and authorization path for scraper targets.

If networking (or part of networking) is down and a scraper target cannot reach the kube-apiserver to verify a token and a subjectaccessreview, then the metrics scraper can be rejected. The subjectaccessreview (authorization) is already largely addressed, but service account tokens are still used for scraping targets. Tokens require an external network call that we can avoid by using client certificates. Gathering metrics, especially client metrics, from partially functionally clusters helps narrow the search area between kube-apiserver, etcd, kubelet, and SDN considerably.

In addition, this will significantly reduce the load on the kube-apiserver. We have observed in the CI cluster that token and subject access reviews are a significant percentage of all kube-apiserver traffic.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User story:

As cluster-policy-controller I automatically approve cert signing requests issued by monitoring.

DoD:

  • cert signing requests issued by the cluster-monitoring-operator service account are approved automatically.

Implementation hints: leverage approving logic implemented in https://github.com/openshift/library-go/pull/1083.

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

ListPage is still JSX, we should convert to TSX, add proper types and make sure rest of the code is passing correct props.

resources.js contains functions to work with k8s API (CRUD). It would be good to convert to TS and add proper types. We will want to expose these functions in some form to dynamic plugins too so proper types is a must

Table component is Class component currently, we want to update to function component.

There's also many properties with `any`  type, we will want to reduce those and be more strict.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Rebase OpenShift components to k8s v1.22
  • Rebase Jenkins and plugins to latest long term support versions

Why is this important?

  • Rebasing ensures components work with the upcoming release of Kubernetes
  • Address tech debt related to upstream deprecations and removals.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. k8s 1.22 release - expected August 4th 2021

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

Rebase samples operator to k8s 1.22

Acceptance Criteria

  • Samples operator deploys with k8s 1.22 libraries
  • Core components continue to function (CI tests pass, including build suite).

Docs Impact

None

Notes

Problem:

This epic is mainly focused to track the dev console QE activities for 4.9 Release

Goal:

1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis

Why is it important?

To the track the QE progress at one place

Acceptance criteria:

  1. <criteria>

Dependencies (External/Internal):

  1. External [For reviewing existing gherkin scripts]
  2. Internal [Planned tasks for 4.9 release]

Design Artifacts:

Exploration:

Note:

While executing the script "Yarn run gherkin-lint", error is displaying due to ","

Scenario length increased to 20. To avoid couple of quick start scenarios errors

max scenarios increased to 20 in feature files, because for few features this is needed

Description

Automation of Application grouping under display options
As a user,

Acceptance Criteria

  1. Display dropdown
  2. Consuption mode

Additional Details:

Description

Automate the quick starts - quick-start-devperspective.feature

 

Acceptance Criteria

  1. Execute the test scenarios manually
  2. Execute them on chrome browser
  3. Remove the @to-do tags once it is done
  4. Execute them on remote cluster

Additional Details:

Update the OWNERS file in all plugin folder. As Gaja and praveen left from the org

Description

Topology chart view automation
As a user,

Acceptance Criteria

  1. Empty state of Topology
  2. Topology with workloads
  3. Filters in chart view

Additional Details:

Problem:

This epic is mainly focused on the 4.10 Release QE activities

Goal:

1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis

Why is it important?

To the track the QE progress at one place in 4.10 Release Confluence page

Use cases:

  1. <case>

Acceptance criteria:

  1. <criteria>

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Epic Goal

  • Teach the cluster-version operator how to remove in-cluster objects.

Why is this important?

  • OpenShift releases can remove components.  For example, two service-catalog operators were removed in 4.5.  Teaching the CVO how to remove manifests will allow us to clean up these resources without leaving cleanup jobs and associated RBAC behind.  We may also benefit from it in some rollback cases, where the presence of a new-in 4.(y+1) object makes the application of 4.y difficult.

Scenarios

  1. Born-before-4.5 clusters currently have dangling resources from cleaning up the service-catalog operators (like the openshift-service-catalog-removed namespace in this job).  This enhancement would provide a mechanism for removing them, and any other cruft that we accumulate which lack in-cluster operators.
  2. We can add removal manifests to the 4.(y-1) before growing a new component in 4.y, to make 4.y->4.(y-1) rollback more convenient for customers.

Acceptance Criteria

  • CVO CI - MUST be running successfully with automated e2e-operator tests covering this functionality.
  • Release Technical Enablement - Not required.  This is a purely OCP-internal change.

Dependencies (internal and external)

  1. None

Previous Work (Optional):

  1. Enhancement from OTA-279 has been accepted.  Remaining work is just implementing the accepted proposal.

Open questions::

  1. None

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Description of problem:

User expects a shadow on the form footer when the add page content is longer then the viewport shows. This was shown in 4.8.

Prerequisites (if any, like setup, operators/versions):

None

Steps to Reproduce

  1. Switch to developer perspective
  2. Add page
  3. Import from Git for example

Actual results:

No shadow at the top of the form footer when the content view is scrollable.

Expected results:

A shadow at the top of the form footer when the content view is scrollable.

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

Happen on a cluster (4.9.0-0.nightly-2021-07-07-021823)
and local development (4.9 master, tested with 0588bc0f0b838ae448a68f35c5424f9bbfc65bc9)

Additional info:

None

Description of problem:

The QuickStart content shows a shadow above and/or below the content when the user can scroll into that direction. This feature is missing now.

Prerequisites (if any, like setup, operators/versions):

None

Steps to Reproduce

  1. Open a quick start
  2. Reduce the window so that the content of the quick start is scrollable.

Actual results:

No shadow when the user can scroll the content into the content direction.

Expected results:

A shadow when the user can scroll the content into the content direction.

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

4.9 (tested on master 2860e58114b4f811ac2ebf6ce34dd99263920e17)

Additional info:

Quick start is now extracted into PatternFly

Create README based on shared Google doc and add it to OpenShift so we have documentation for i18n work.

Description

All test scenarios related to add flow should get executed on CI

Acceptance Criteria

  • Update the devconsole/package.json
  • Verify the scripts on remote cluster

Additional Details:

Description of problem:

The topology use a PatternFly toolbar component to render its toolbar. It looks like the latest version uses a grid and flex layout which adds an addtional spacing below the toolbar.

Prerequisites (if any, like setup, operators/versions):

None

Steps to Reproduce

  1. Open developer console
  2. Check topology toolbar design

Actual results:

There is a spacing below the toolbar (Display options, Filter by resource, Filter by name), see attached screenshot.

Expected results:

No additional spacing below the toolbar. (See screenshot of 4.8)

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

4.9 master, tested with commit 3c6537eec4f5c165cf214c4100bddeccc104ed44

Additional info:

Maybe this is a patternfly issue, or we should check if the second div in the toolbar should not be rendered. See attached screenshot.

Description of problem:

.NET builder image is not getting detected

Prerequisites (if any, like setup, operators/versions):

Install Openshift Pipelines Operator

Steps to Reproduce

  1. Follow steps of the test case

Actual results:

When git url of a .NET project is provided, the builder image is not getting detected

Expected results:

When git url of a .NET project is provided, the builder image should automatically get detected

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

Additional info:

Description of problem:

When the user opens the topology sidebar for a Deployment with a BuildConfig the string Build #x was complete (and others) are not translated. The browser log shows also an error

Missing i18n key "Build <1></1> was complete" in namespace "public" and language "en."

The strings are translated but are saved as:

Build {link} was complete

Prerequisites (if any, like setup, operators/versions):

None

Steps to Reproduce

  1. Open developer perspective
  2. Add page
  3. Import from Git
  4. Select the deployment in the topology graph
  5. Switch the language

Actual results:

Build #x ... string is not translated. An error is logged in the browser console.

Expected results:

String is shown in the selected language and no error is logged in the browser console.

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

Happen on a cluster (4.9.0-0.nightly-2021-07-07-021823)
and local development (4.9 master, tested with 0588bc0f0b838ae448a68f35c5424f9bbfc65bc9)

Additional info:

None

Other Incomplete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled

Please review the following PR: https://github.com/openshift/containernetworking-plugins/pull/150

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Following https://issues.redhat.com/browse/CNV-28040
On CNV, when virtual machine, with secondary interfaces connected with bridge CNI, is live migrated we observe disruption at the VM inbound traffic.

The root cause for it is the migration target bridge interface advertise before the migration is completed.

When the migration destination pod is created an IPv6 NS (Neighbor Solicitation)
and NA (Neighbor Advertisement) are sent automatically by the kernel.
The switches at the endpoints (e.g.: migration destination node) tables
get updated and the traffic is forwarded to the migration destination before
the migration is completed [1].

The solution is to have the bridge CNI create the pod interface in "link-down" state [2], the IPv6 NS/NA packets are avoided, CNV in turn, set the pod interface to "link-up" [3].

CNV depends on bridge CNI with [2] bits, which is deployed by cluster-network-operator.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2186372#c6
[2] https://github.com/kubevirt/kubevirt/pull/11069
[3] https://github.com/containernetworking/plugins/pull/997  

Version-Release number of selected component (if applicable):

4.16.0

How reproducible:

100%

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

CNO deploys CNI bridge w/o an option to set the bridge interface down.

Expected results:

CNO to deploy bridge CNI with [1] changes, from release-4.16 branch. 

[1] https://github.com/containernetworking/plugins/pull/997

Additional info:

More https://issues.redhat.com/browse/CNV-28040    

 
 
 
 

 

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/containernetworking-plugins/pull/122

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

One multus case always fail in QE e2e testing. Using same net-attach-def and pod configure files, testing passed in 4.11 but failed in 4.12 and 4.13

Version-Release number of selected component (if applicable):

4.12 and 4.13

How reproducible:

All the times

Steps to Reproduce:

[weliang@weliang networking]$ oc create -f https://raw.githubusercontent.com/weliang1/verification-tests/master/testdata/networking/multus-cni/NetworkAttachmentDefinitions/runtimeconfig-def-ipandmac.yaml
networkattachmentdefinition.k8s.cni.cncf.io/runtimeconfig-def created
[weliang@weliang networking]$ oc get net-attach-def -o yaml
apiVersion: v1
items:
- apiVersion: k8s.cni.cncf.io/v1
  kind: NetworkAttachmentDefinition
  metadata:
    creationTimestamp: "2023-01-03T16:33:03Z"
    generation: 1
    name: runtimeconfig-def
    namespace: test
    resourceVersion: "64139"
    uid: bb26c08f-adbf-477e-97ab-2aa7461e50c4
  spec:
    config: '{ "cniVersion": "0.3.1", "name": "runtimeconfig-def", "plugins": [{ "type":
      "macvlan", "capabilities": { "ips": true }, "mode": "bridge", "ipam": { "type":
      "static" } }, { "type": "tuning", "capabilities": { "mac": true } }] }'
kind: List
metadata:
  resourceVersion: ""
[weliang@weliang networking]$ oc create -f https://raw.githubusercontent.com/weliang1/verification-tests/master/testdata/networking/multus-cni/Pods/runtimeconfig-pod-ipandmac.yaml
pod/runtimeconfig-pod created
[weliang@weliang networking]$ oc get pod
NAME                READY   STATUS              RESTARTS   AGE
runtimeconfig-pod   0/1     ContainerCreating   0          6s
[weliang@weliang networking]$ oc describe pod runtimeconfig-pod
Name:         runtimeconfig-pod
Namespace:    test
Priority:     0
Node:         weliang-01031-bvxtz-worker-a-qlwz7.c.openshift-qe.internal/10.0.128.4
Start Time:   Tue, 03 Jan 2023 11:33:45 -0500
Labels:       <none>
Annotations:  k8s.v1.cni.cncf.io/networks: [ { "name": "runtimeconfig-def", "ips": [ "192.168.22.2/24" ], "mac": "CA:FE:C0:FF:EE:00" } ]
              openshift.io/scc: anyuid
Status:       Pending
IP:           
IPs:          <none>
Containers:
  runtimeconfig-pod:
    Container ID:   
    Image:          quay.io/openshifttest/hello-sdn@sha256:c89445416459e7adea9a5a416b3365ed3d74f2491beb904d61dc8d1eb89a72a4
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-k5zqd (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-k5zqd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age   From               Message
  ----     ------                  ----  ----               -------
  Normal   Scheduled               26s   default-scheduler  Successfully assigned test/runtimeconfig-pod to weliang-01031-bvxtz-worker-a-qlwz7.c.openshift-qe.internal
  Normal   AddedInterface          24s   multus             Add eth0 [10.128.2.115/23] from openshift-sdn
  Warning  FailedCreatePodSandBox  23s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_runtimeconfig-pod_test_7d5f3e7a-846d-4cfb-ac78-fd08b27102ae_0(cff792dbd07e8936d04aad31964bd7b626c19a90eb9d92a67736323a1a2303c4): error adding pod test_runtimeconfig-pod to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [test/runtimeconfig-pod/7d5f3e7a-846d-4cfb-ac78-fd08b27102ae:runtimeconfig-def]: error adding container to network "runtimeconfig-def": Interface name contains an invalid character /
  Normal   AddedInterface          7s    multus             Add eth0 [10.128.2.116/23] from openshift-sdn
  Warning  FailedCreatePodSandBox  7s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_runtimeconfig-pod_test_7d5f3e7a-846d-4cfb-ac78-fd08b27102ae_0(d2456338fa65847d5dc744dea64972912c10b2a32d3450910b0b81cdc9159ca4): error adding pod test_runtimeconfig-pod to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [test/runtimeconfig-pod/7d5f3e7a-846d-4cfb-ac78-fd08b27102ae:runtimeconfig-def]: error adding container to network "runtimeconfig-def": Interface name contains an invalid character /
[weliang@weliang networking]$ 
 

Actual results:

Pod is not running

Expected results:

Pod should be in running state

Additional info:

 

Currently we see this issue:

 
Aug 28 00:02:20.755103 ip-10-0-131-145 hyperkube[1366]: I0828 00:02:20.755067 1366 prober.go:116] "Probe failed" probeType="Readiness" pod="openshift-etcd/etcd-quorum-guard-588ff9b55d-8lhb7" podUID=5b79def2-9e56-4c93-b8ab-1d04db0f552f containerName="guard" probeResult=failure output=""
then few seconds later
Aug 28 00:02:25.797258 ip-10-0-131-145 hyperkube[1366]: I0828 00:02:25.797231 1366 kubelet.go:2175] "SyncLoop (probe)" probe="readiness" status="ready" pod="openshift-etcd/etcd-quorum-guard-588ff9b55d-8lhb7"
Try to improve the clustermembercontroller sync loop for health status or just improve to not fail there on probe quard during install at least or scale. Instead of maybe operator status use metrics to track this.

Slack for more context https://coreos.slack.com/archives/C027U68LP/p1630506922034600

 

AC: 

  • come up with a solution which approach we want to take and present in the team meeting
  • implement the proposed solution

Description of problem:

clusteroperator/network is degraded after running

    FEATURES_ENVIRONMENT="ci" make feature-deploy-on-ci

from openshift-kni/cnf-features-deploy against IPI clusters with OCP 4.13 and 4.14 in CI jobs from Telco 5G DevOps/CI.

Details for a 4.13 job:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/42141/rehearse-42141-periodic-ci-openshift-release-master-nightly-4.13-e2e-telco5g/1689935408508440576

Details for a 4.14 job:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/42141/rehearse-42141-periodic-ci-openshift-release-master-nightly-4.14-e2e-telco5g/1689935408541995008

For example, got to artifacts/e2e-telco5g/telco5g-gather-pao/build-log.txt and it will report:

Error from server (BadRequest): container "container-00" in pod "cnfdu5-worker-0-debug" is waiting to start: ContainerCreating
Running gather-pao for T5CI_VERSION=4.13
Running for CNF_BRANCH=master
Running PAO must-gather with tag pao_mg_tag=4.12
[must-gather      ] OUT Using must-gather plug-in image: quay.io/openshift-kni/performance-addon-operator-must-gather:4.12-snapshot
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 60503edf-ecc6-48f7-b6a6-f4dc34842803
ClusterVersion: Stable at "4.13.0-0.nightly-2023-08-10-021434"
ClusterOperators:
	clusteroperator/network is degraded because DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-7lmlq is in CrashLoopBackOff State
DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-95tzb is in CrashLoopBackOff State
DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-hfxkd is in CrashLoopBackOff State
DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-mhwtp is in CrashLoopBackOff State
DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - pod dhcp-daemon-q7gfb is in CrashLoopBackOff State
DaemonSet "/openshift-multus/dhcp-daemon" rollout is not making progress - last change 2023-08-11T10:54:10Z

Version-Release number of selected component (if applicable):

branch release-4.13 from https://github.com/openshift-kni/cnf-features-deploy.git for OCP 4.13
branch master from https://github.com/openshift-kni/cnf-features-deploy.git for OCP 4.14

How reproducible:

Always.

Steps to Reproduce:

1. Install OCP 4.13 or OCP 4.14 with IPI on 3x masters, 2x workers.
2. Clone https://github.com/openshift-kni/cnf-features-deploy.git
3. FEATURES_ENVIRONMENT="ci" make feature-deploy-on-ci
4. oc wait nodes --all --for=condition=Ready=true --timeout=10m
5. oc wait clusteroperators --all --for=condition=Progressing=false --timeout=10m

Actual results:

See above.

Expected results:

All clusteroperators have finished progressing.

Additional info:

Without 'FEATURES_ENVIRONMENT="ci" make feature-deploy-on-ci' the steps to reproduce above work as expected.

Please review the following PR: https://github.com/openshift/containernetworking-plugins/pull/142

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Multus mac-vlan/ipvlan/vlan cni panics when master interface in container is missing

Version-Release number of selected component (if applicable):

metallb-operator.v4.13.0-202304190216   MetalLB Operator   4.13.0-202304190216 Succeeded

How reproducible:

Create pod with multiple vlan interfaces connected to missing master interface.

Steps to Reproduce:

1. Create pod with multiple vlan interfaces connected to missing master interface in container
2. Make sure that pod stuck in ContainerCreating state 
3. Run oc describe pod PODNAME and read crash message:

 Normal   Scheduled               22s   default-scheduler  Successfully assigned cni-tests/pod-one to worker-0
  Normal   AddedInterface          21s   multus             Add eth0 [10.128.2.231/23] from ovn-kubernetes
  Normal   AddedInterface          21s   multus             Add ext0 [] from cni-tests/tap-one
  Normal   AddedInterface          21s   multus             Add ext0.1 [2001:100::1/64] from cni-tests/mac-vlan-one
  Warning  FailedCreatePodSandBox  18s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-one_cni-tests_2e831519-effc-4502-8ea7-749eda95bf1d_0(321d7181626b8bbfad062dd7c7cc2ef096f8547e93cb7481a18b7d3613eabffd): error adding pod cni-tests_pod-one to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [cni-tests/pod-one/2e831519-effc-4502-8ea7-749eda95bf1d:mac-vlan]: error adding container to network "mac-vlan": plugin type="macvlan" failed (add): netplugin failed: "panic: runtime error: invalid memory address or nil pointer dereference\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x54281a]\n\ngoroutine 1 [running, locked to thread]:\npanic({0x560b00, 0x6979d0})\n\t/usr/lib/golang/src/runtime/panic.go:987 +0x3ba fp=0xc0001ad8f0 sp=0xc0001ad830 pc=0x433d7a\nruntime.panicmem(...)\n\t/usr/lib/golang/src/runtime/panic.go:260\nruntime.sigpanic()\n\t/usr/lib/golang/src/runtime/signal_unix.go:835 +0x2f6 fp=0xc0001ad940 sp=0xc0001ad8f0 pc=0x449cd6\nmain.getMTUByName({0xc00001a978, 0x4}, {0xc00002004a, 0x33}, 0x1)\n\t/usr/src/plugins/plugins/main/macvlan/macvlan.go:167 +0x33a fp=0xc0001ada00 sp=0xc0001ad940 pc=0x54281a\nmain.loadConf(0xc000186770, {0xc00001e009, 0x19e})\n\t/usr/src/plugins/plugins/main/macvlan/macvlan.go:120 +0x155 fp=0xc0001ada80 sp=0xc0001ada00 pc=0x5422d5\nmain.cmdAdd(0xc000186770)\n\t/usr/src/plugins/plugins/main/macvlan/macvlan.go:287 +0x47 fp=0xc0001adcd0 sp=0xc0001ada80 pc=0x543b07\ngithub.com/containernetworking/cni/pkg/skel.(*dispatcher).checkVersionAndCall(0xc0000bdec8, 0xc000186770, {0x5c02b8, 0xc0000e4e40}, 0x592e80)\n\t/usr/src/plugins/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:166 +0x20a fp=0xc0001add60 sp=0xc0001adcd0 pc=0x5371ca\ngithub.com/containernetworking/cni/pkg/skel.(*dispatcher).pluginMain(0xc0000bdec8, 0x698320?, 0xc0000bdeb0?, 0x44ed89?, {0x5c02b8, 0xc0000e4e40}, {0xc0000000f0, 0x22})\n\t/usr/src/plugins/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:219 +0x2ca fp=0xc0001ade68 sp=0xc0001add60 pc=0x53772a\ngithub.com/containernetworking/cni/pkg/skel.PluginMainWithError(...)\n\t/usr/src/plugins/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:273\ngithub.com/containernetworking/cni/pkg/skel.PluginMain(0x588e01?, 0x10?, 0xc0000bdf50?, {0x5c02b8?, 0xc0000e4e40?}, {0xc0000000f0?, 0x0?})\n\t/usr/src/plugins/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:288 +0xd1 fp=0xc0001adf18 sp=0xc0001ade68 pc=0x537d51\nmain.main()\n\t/usr/src/plugins/plugins/main/macvlan/macvlan.go:432 +0xb6 fp=0xc0001adf80 sp=0xc0001adf18 pc=0x544b76\nruntime.main()\n\t/usr/lib/golang/src/runtime/proc.go:250 +0x212 fp=0xc0001adfe0 sp=0xc0001adf80 pc=0x436a12\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0001adfe8 sp=0xc0001adfe0 pc=0x462fc1\n\ngoroutine 2 [force gc (idle)]:\nruntime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)\n\t/usr/lib/golang/src/runtime/proc.go:363 +0xd6 fp=0xc0000acfb0 sp=0xc0000acf90 pc=0x436dd6\nruntime.goparkunlock(...)\n\t/usr/lib/golang/src/runtime/proc.go:369\nruntime.forcegchelper()\n\t/usr/lib/golang/src/runtime/proc.go:302 +0xad fp=0xc0000acfe0 sp=0xc0000acfb0 pc=0x436c6d\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0000acfe8 sp=0xc0000acfe0 pc=0x462fc1\ncreated by runtime.init.6\n\t/usr/lib/golang/src/runtime/proc.go:290 +0x25\n\ngoroutine 3 [GC sweep wait]:\nruntime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)\n\t/usr/lib/golang/src/runtime/proc.go:363 +0xd6 fp=0xc0000ad790 sp=0xc0000ad770 pc=0x436dd6\nruntime.goparkunlock(...)\n\t/usr/lib/golang/src/runtime/proc.go:369\nruntime.bgsweep(0x0?)\n\t/usr/lib/golang/src/runtime/mgcsweep.go:278 +0x8e fp=0xc0000ad7c8 sp=0xc0000ad790 pc=0x423e4e\nruntime.gcenable.func1()\n\t/usr/lib/golang/src/runtime/mgc.go:178 +0x26 fp=0xc0000ad7e0 sp=0xc0000ad7c8 pc=0x418d06\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0000ad7e8 sp=0xc0000ad7e0 pc=0x462fc1\ncreated by runtime.gcenable\n\t/usr/lib/golang/src/runtime/mgc.go:178 +0x6b\n\ngoroutine 4 [GC scavenge wait]:\nruntime.gopark(0xc0000ca000?, 0x5bf2b8?, 0x1?, 0x0?, 0x0?)\n\t/usr/lib/golang/src/runtime/proc.go:363 +0xd6 fp=0xc0000adf70 sp=0xc0000adf50 pc=0x436dd6\nruntime.goparkunlock(...)\n\t/usr/lib/golang/src/runtime/proc.go:369\nruntime.(*scavengerState).park(0x6a0920)\n\t/usr/lib/golang/src/runtime/mgcscavenge.go:389 +0x53 fp=0xc0000adfa0 sp=0xc0000adf70 pc=0x421ef3\nruntime.bgscavenge(0x0?)\n\t/usr/lib/golang/src/runtime/mgcscavenge.go:617 +0x45 fp=0xc0000adfc8 sp=0xc0000adfa0 pc=0x4224c5\nruntime.gcenable.func2()\n\t/usr/lib/golang/src/runtime/mgc.go:179 +0x26 fp=0xc0000adfe0 sp=0xc0000adfc8 pc=0x418ca6\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0000adfe8 sp=0xc0000adfe0 pc=0x462fc1\ncreated by runtime.gcenable\n\t/usr/lib/golang/src/runtime/mgc.go:179 +0xaa\n\ngoroutine 5 [finalizer wait]:\nruntime.gopark(0x0?, 0xc0000ac670?, 0xab?, 0x61?, 0xc0000ac770?)\n\t/usr/lib/golang/src/runtime/proc.go:363 +0xd6 fp=0xc0000ac628 sp=0xc0000ac608 pc=0x436dd6\nruntime.goparkunlock(...)\n\t/usr/lib/golang/src/runtime/proc.go:369\nruntime.runfinq()\n\t/usr/lib/golang/src/runtime/mfinal.go:180 +0x10f fp=0xc0000ac7e0 sp=0xc0000ac628 pc=0x417e0f\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0000ac7e8 sp=0xc0000ac7e0 pc=0x462fc1\ncreated by runtime.createfing\n\t/usr/lib/golang/src/runtime/mfinal.go:157 +0x45\n"

Actual results:

The readable error message should be provided instead.

Expected results:

We should handle such scenario without crash and The following log should be used instead. 

Error: Failed to create container due to the missing master interface XXX.

Additional info: