Back to index

4.14.0-0.okd-scos-2023-05-23-224540

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.13.0-0.okd-scos-2024-02-13-152021

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Feature Overview

At the moment, HyperShift is relying on an older etcd operator (i.e, the CoreOS etcd operator). However, this operator is basic and does not support HA as required.  

Goals

Introduce a reliable component to operate Etcd that: 

  • Is backed up by a stable operator
  • Supports Images with a Hash
  • Supprts  for Backups
  • Local-persistent volumes for persistent data? 
  • Encryption.
  • HA and Scalablity. 

 

Following on from https://issues.redhat.com/browse/HOSTEDCP-444 we need to add the steps to enable migration of the Node/CAPI resources to enable workloads to continue running during controlplane migration.

This will be a manual process where controlplane downtime will occur.

 

This must satisfy a successful migration criteria:

  • All HC conditions are positive.
  • All NodePool conditions are positive.
  • All service endpoints kas/oauth/ignition server... are reachable.
  • Ability to create/scale NodePools remains operational.

We need to validate and document this manually for starters.

Eventually this should be automated in the upcoming e2e test.

We could even have a job running conformance tests over a migrated cluster

Background and Goal

Currently in OpenShift we do not support adding 3rd party agents and other software to cluster nodes. While rpm-ostree supports adding packages, we have no way today to do that in a sane, scalable way across machineconfigpools and clusters. Some customers may not be able to meet their IT policies due to this.

In addition to third party content, some customers may want to use the layering process as a point to inject configuration. The build process allows for simple copying of config files and the ability to run arbitrary scripts to set user config files (e.g. through an Ansible playbook). This should be a supported use case, except where it conflicts with OpenShift (for example, the MCO must continue to manage Cri-O and Kubelet configs).

Example Use Cases

  • Bare metal firmware update software that is packaged as an RPM
  • Host security monitors
  • Forensic tools
  • SEIM logging agents
  • SSH Key management
  • Device Drivers from OEM/ODM partners

Acceptance Criteria

  1. Administrators can deploy 3rd party repositories and packages to MachineConfigPools.
  2. Administrators can easily remove added packages and repository files.
  3. Administrators can manage system configuration files by copying files into the RHCOS build. [Note: if the same file is managed by the MCO, the MachineConfig version of the file is expected to "win" over the OS image version.]

Background

As part of enabling OCP CoreOS Layering for third party components, we will need to allow for package installation to /opt. Many OEMs and ISVs install to /opt and it would be difficult for them to make the change only for RHCOS. Meanwhile changing their RHEL target to a different target would also be problematic as their customers are expecting these tools to install in a certain way. Not having to worry about this path will provide the best ecosystem partner and customer experience.

Requirements

  • Document how 3rd party vendors can be compatible with our current offering.
  • Provide mechanism for 3rd party vendors or their customers to provide information for exceptions that require an RPM to install binaries to /opt as an install target path.

OC mirror is GA product as of Openshift 4.11 .

The goal of this feature is to solve any future customer request for new features or capabilities in OC mirror 

In 4.12 release, a new feature was introduced to oc-mirror allowing it to use OCI FBC catalogs as starting point for mirroring operators.

Overview

As a oc-mirror user, I would like the OCI FBC feature to be stable
so that I can use it in a production ready environment
and to make the new feature and all existing features of oc-mirror seamless

Current Status

This feature is ring-fenced in the oc mirror repository, it uses the following flags to achieve this so as not to cause any breaking changes in the current oc-mirror functionality.

  • --use-oci-feature
  • --oci-feature-action (copy or mirror)
  • --oci-registries-config

The OCI FBC (file base container) format has been delivered for Tech Preview in 4.12

Tech Enablement slides can be found here https://docs.google.com/presentation/d/1jossypQureBHGUyD-dezHM4JQoTWPYwiVCM3NlANxn0/edit#slide=id.g175a240206d_0_7

Design doc is in https://docs.google.com/document/d/1-TESqErOjxxWVPCbhQUfnT3XezG2898fEREuhGena5Q/edit#heading=h.r57m6kfc2cwt (also contains latest design discussions around the stories of this epic)

Link to previous working epic https://issues.redhat.com/browse/CFE-538

Contacts for the OCI FBC feature

 

Feature Overview

Goals

  • Support OpenShift to be deployed from day-0 on AWS Local Zones
  • Support an existing OpenShift cluster to deploy compute Nodes on AWS Local Zones (day-2)

AWS Local Zones support - feature delivered in phases:

  • Phase 0 (OCPPLAN-9630): Document how to create compute nodes on AWS Local Zones in day-0 (SPLAT-635)
  • Phase 1 ( OCPBU-2): Create edge compute pool to generate MachineSets for node with NoSchedule taints when installing a cluster in existing VPC with AWS Local Zone subnets (SPLAT-636)
  • Phase 2 (OCPBU-351): Installer automates network resources creation on Local Zone based on the edge compute pool (SPLAT-657)

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

<!--

Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:

https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/

As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.

Before submitting it, please make sure to remove all comments like this one.

-->

{}USER STORY:{}

<!--

One sentence describing this story from an end-user perspective.

-->

As a [type of user], I want [an action] so that [a benefit/a value].

{}DESCRIPTION:{}

<!--

Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.

-->

{}Required:{}

...

{}Nice to have:{}

...

{}ACCEPTANCE CRITERIA:{}

<!--

Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.

-->

{}ENGINEERING DETAILS:{}

<!--

Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.

-->

Feature Overview

Testing is one of the main pillars of production-grade software. It helps validate and flag issues early on before the code is shipped into productive landscapes. Code changes no matter how small they are might lead to bugs and outages, the best way to validate bugs is to write proper tests, and to run those tests we need to have a foundation for a test infrastructure, finally, to close the circle, automation of these tests and their corresponding build help reduce errors and save a lot of time.

Goal(s)

  • How do we get infrastructure, what infrastructure accounts are required?
  • Build e2e integration with openshift-release on AWS.
  • Define MVP CI Jobs to validate (e.g., conformance). What tests are failing, are we skipping any? why? 

Note: Sync with the Developer productivity teams might be required to understand infra requirements especially for our first HyperShift infrastructure backend, AWS.

Context:

This is a placeholder epic to capture all the e2e scenarios that we want to test in CI in the long term. Anything which is a TODO here should at minimum be validated by QE as it is developed.

DoD:

Every supported scenario is e2e CI tested.

Scenarios:

  • Hypershift deployment with services as routes.
  • Hypershift deployment with services as NodePorts.

 

DoD:

Refactor the E2E tests following new pattern with 1 HostedCluster and targeted NodePools:

  • nodepool_upgrade_test.go

 

Feature Overview

Allow users to interactively adjust the network configuration for a host after booting the agent ISO.

Goals

Configure network after host boots

The user has Static IPs, VLANs, and/or bonds to configure, but has no idea of the device names of the NICs. They don't enter any network config in agent-config.yaml. Instead they configure each host's network via the text console after it boots into the image.

Epic Goal

  • Allow users to interactively adjust the network configuration for a host after booting the agent ISO, before starting processes that pull container images.

Why is this important?

  • Configuring the network prior to booting a host is difficult and error-prone. Not only is the nmstate syntax fairly arcane, but the advent of 'predictable' interface names means that interfaces retain the same name across reboots but it is nearly impossible to predict what they will be. Applying configuration to the correct hosts requires correct knowledge and input of MAC addresses. All of these present opportunities for things to go wrong, and when they do the user is forced to return to the beginning of the process and generate a new ISO, then boot all of the hosts in the cluster with it again.

Scenarios

  1. The user has Static IPs, VLANs, and/or bonds to configure, but has no idea of the device names of the NICs. They don't enter any network config in agent-config.yaml. Instead they configure each host's network via the text console after it boots into the image.
  2. The user has Static IPs, VLANs, and/or bonds to configure, but makes an error entering the configuration in agent-config.yaml so that (at least) one host will not be able to pull container images from the release payload. They correct the configuration for that host via the text console before proceeding with the installation.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

When the UI is active in the console events messages that are generated will distort the interface and make it difficult for the user to view the configuration and select options. An example is shown in the attached screenshot.

The openshift-install agent create image will need to fetch the agent-tui executable so that it could be embedded within the agent ISO. For this reason the agent-tui must be available in the release payload, so that it could be retrieved even when the command is invoked in a disconnected environment.

Currently the agent-tui displays always the additional checks (nslookup/ping/http get), even when the primary check (pull image) passes. This may cause some confusion to the user, due the fact that the additional checks do not prevent the agent-tui to complete successfully but they are just informative, to allow a better troubleshooting of the issue (so not needed in the positive case).

The additional checks should then be shown only when the primary check fails for any reason.

When the agent-tui is shown during the initial host boot, if the pull release image check fails then an additional checks box is shown along with a details text view.
The content of the details view gets continuosly updated with the details of failed check, but the user cannot move the focus over the details box (using the arrow/tab keys), thus cannot scroll its content (using the up/down arrow keys)

  1. Proposed title of this feature request:

Update ETCD datastore encryption to use AES-GCM instead of AES-CBC

2. What is the nature and description of the request?

The current ETCD datastore encryption solution uses the aes-cbc cipher. This cipher is now considered "weak" and is susceptible to padding oracle attack.  Upstream recommends using the AES-GCM cipher. AES-GCM will require automation to rotate secrets for every 200k writes.

The cipher used is hard coded. 

3. Why is this needed? (List the business requirements here).

Security conscious customers will not accept the presence and use of weak ciphers in an OpenShift cluster. Continuing to use the AES-CBC cipher will create friction in sales and, for existing customers, may result in OpenShift being blocked from being deployed in production. 

4. List any affected packages or components.

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

The Kube APIserver is used to set the encryption of data stored in etcd. See https://docs.openshift.com/container-platform/4.11/security/encrypting-etcd.html

 

Today with OpenShift 4.11 or earlier, only aescbc is allowed as the encryption field type. 

 

RFE-3095 is asking that aesgcm (which is an updated and more recent type) be supported. Furthermore RFE-3338 is asking for more customizability which brings us to how we have implemented cipher customzation with tlsSecurityProfile. See https://docs.openshift.com/container-platform/4.11/security/tls-security-profiles.html

 

 
Why is this important? (mandatory)

AES-CBC is considered as a weak cipher

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

AES-GCM encryption was enabled in cluster-openshift-apiserver-operator and cluster-openshift-autenthication-operator, but not in the cluster-kube-apiserver-operator. When trying to enable aesgcm encryption in the apiserver config, the kas-operator will produce an error saying that the aesgcm provider is not supported.

Epic Goal

  • Enable the migration from a storage intree driver to a CSI based driver with minimal impact to the end user, applications and cluster
  • These migrations would include, but are not limited to:
    • CSI driver for Azure (file and disk)
    • CSI driver for VMware vSphere

Why is this important?

  • OpenShift needs to maintain it's ability to enable PVCs and PVs of the main storage types
  • CSI Migration is getting close to GA, we need to have the feature fully tested and enabled in OpenShift
  • Upstream intree drivers are being deprecated to make way for the CSI drivers prior to intree driver removal

Scenarios

  1. User initiated move to from intree to CSI driver
  2. Upgrade initiated move from intree to CSI driver
  3. Upgrade from EUS to EUS

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal*

Kubernetes upstream has chosen to allow users to opt-out from CSI volume migration in Kubernetes 1.26 (1.27 PR, 1.26 backport). It is still GA there, but allows opt-out due to non-trivial risk with late CSI driver availability.

We want a similar capability in OCP - a cluster admin should be able to opt-in to CSI migration on vSphere in 4.13. Once they opt-in, they can't opt-out (at least in this epic).

Why is this important? (mandatory)

See an internal OCP doc if / how we should allow a similar opt-in/opt-out in OCP.

 
Scenarios (mandatory) 

Upgrade

  1. Admin upgrades 4.12 -> 4.13 as usual
  2. Storage CR has CSI migration disabled (or nil), in-tree volume plugin handles in-tree PVs.
  3. At the same time, external CCM runs, however, due to kubelet running with –cloud-provider=vsphere, it does not do kubelet’s job.
  1. Admin can opt-in to CSI migration by editing Storage CR. That enables OPENSHIFT_DO_VSPHERE_MIGRATION env. var. everywhere + runs kubelet with –cloud-provider=external.
    1. If we have time, it should not be hard to opt out, just remove the env. var + update kubelet cmdline. Storage / in-tree volume plugin will handle in-tree PVs again, not sure about implications on external CCM.
  2. Once opted-in, it’s not possible to opt out.
  1. Both with opt-in and without it, the cluster is Upgradeable=true. Admin can upgrade to 4.14, CSI migration will be forced there.

 

New install

  1. Admin installs a new 4.13 vSphere cluster, with UPI, IPI, Assisted Installer, or Agent-based Installer.
  2. During installation, Storage CR is created with CSI migration enabled
  3. (We want to have it enabled for a new cluster to enable external CCM and have zonal.  This avoids new clusters from having in-tree as default and then having to go through migration later.)
  4. Resulting cluster has OPENSHIFT_DO_VSPHERE_MIGRATION env. var set + kubelet with –cloud-provider=external + topology support.
  5. Admin cannot opt-out after installation, we expect that they use CSI volumes for everything.
  1. If the admin really wants, they can opt-out before installation by adding a Storage install manifest with CSI migration disabled.

 

EUS to EUS (4.12 -> 4.14)

  • Will have CSI migration enabled once in 4.14
  • During the upgrade, a cluster will have 4.13 masters with CSI migration disabled (see regular upgrade to 4.13 above) + 4.12 kubelets.
  • Once the masters are 4.14, CSI migration is force-enabled there, still, 4.14 KCM + in-tree volume plugin in it will handle in-tree volume attachments required by kubelets that still have 4.12 (that’s what kcm --external-cloud-volume-plugin=vsphere does).
  • Once both masters + kubelets are 4.14, CSI migration is force enabled everywhere, in-tree volume plugin + cloud provider in KCM is still enabled by --external-cloud-volume-plugin, but it’s not used.
  • Keep in-tree storage class by default
  • A CSI storage class is already available since 4.10
  • Recommend to switch default to CSI
  • Can’t opt out from migration
    Dependencies (internal and external) (mandatory)
  • We need a new FeatureSet in openshift/api that disables CSIMigrationvSphere feature gate.
  • We need kube-apiserver-operator, kube-controller-manager-operator, kube-scheduler-operator, MCO must reconfigure their operands to use in-tree vSphere cloud provider when they see CSIMigrationvSphere FeatureGate disabled.
  • We need cloud controller manager operator to disable its operand when it sees CSIMigrationvSphere FeatureGate disabled.

Contributing Teams(and contacts) (mandatory) 

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

When CSIMigrationvSphere is disabled, cluster-storage-operator must re-create in-tree StorageClass.

vmware-vsphere-csi-driver-operator's StorageClass must not be marked as the default there (IMO we already have code for that).

This also means we need to fix the Disable SC e2e test to ignore StorageClasses for the in-tree driver. Otherwise we will reintroduce OCPBUGS-7623.

Feature Overview

RHEL CoreOS should be updated to RHEL 9.2 sources to take advantage of newer features, hardware support, and performance improvements.

 

Requirements

  • RHEL 9.x sources for RHCOS builds starting with OCP 4.13 and RHEL 9.2.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

  • 9.2 Preview via Layering No longer necessary assuming we stay the course of going all in on 9.2

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

This is the Epic to track the work to add RHCOS 9 in OCP 4.13 and to make OCP use it by default.

 

CURRENT STATUS: Landed in 4.14 and 4.13

 

Testing with layering

 

Another option given an existing e.g. 4.12 cluster is to use layering.  First, get a digested pull spec for the current build:

$ skopeo inspect --format "{{.Name}}@{{.Digest}}" -n docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev:4.13-9.2
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b4cc3995d5fc11e3b22140d8f2f91f78834e86a210325cbf0525a62725f8e099

Create a MachineConfig that looks like this:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: worker-override
spec:
  osImageURL: <digested pull spec>

If you want to also override the control plane, create a similar one for the master role.
 
We don't yet have auto-generated release images. However, if you want one, you can ask cluster bot to e.g. "launch https://github.com/openshift/machine-config-operator/pull/3485" with options you want (e.g. "azure" etc.) or just "build https://github.com/openshift/machine-config-operator/pull/3485" to get a release image.

STATUS:  Code is merged for 4.13 and is believed to largely solve the problem.

 


 

Description of problem:

Upgrades to from OpenShift 4.12 to 4.13 will also upgrade the underlying RHCOS from 8.6 to 9.2. As part of that the names of the network interfaces may change. For example `eno1` may be renamed to `eno1np0`. If a host is using NetworkManager configuration files that rely on those names then the host will fail to connect to the network when it boots after the upgrade. For example, if the host had static IP addresses assigned it will instead boot using IP addresses assigned via DHCP.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always.

Steps to Reproduce:

1. Select hardware (or VMs) that will have different network interface names in RHCOS 8 and RHCOS 9, for example `eno1` in RHCOS 8 and `eno1np0` in RHCOS 9.

1. Install a 4.12 cluster with static network configuration using the `interface-name` field of NetworkManager interface configuration files to match the configuration to the network interface.

2. Upgrade the cluster to 4.13.

Actual results:

The NetworkManager configuration files are ignored because they don't longer match the NIC names. Instead the NICs get new IP addresses from DHCP.

Expected results:

The NetworkManager configuration files are updated as part of the upgrade to use the new NIC names.

Additional info:

Note this a hypothetical scenario. We have detected this potential problem in a slightly different scenario where we install a 4.13 cluster with the assisted installer. During the discovery phase we use RHCOS 8 and we generate the NetworkManager configuration files. Then we reboot into RHCOS 9, and the configuration files are ignored due to the change in the NICs. See MGMT-13970 for more details.

Epic Goal

  • The Kernel API was updated for RHEL 9, so the old approach of setting the `sched_domain` in `/sys/kernel` is no longer available. Instead, cgroups have to be worked with directly.
  • Both CRI-O and PAO need to be updated to set the cpuset of containers and other processes correctly, as well as set the correct value for sched_load_balance

Why is this important?

  • CPU load balancing is a vital piece of real time execution for processes that need exclusive access to a CPU. Without this, CPU load balancing won't work on RHEL 9 with Openshift 4.13

Scenarios

  1. As a developer on Openshift, I expect my pods to run with exclusive CPUs if I set the PAO configuration correctly

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Part of setting CPU load balancing on RHEL 9 involves disabling sched_load_balance on cgroups that contain a cpuset that should be exclusive. The PAO may be required to be responsible for this piece

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Create a new platform type, working name "External", that will signify when a cluster is deployed on a partner infrastructure where core cluster components have been replaced by the partner. “External” is different from our current platform types in that it will signal that the infrastructure is specifically not “None” or any of the known providers (eg AWS, GCP, etc). This will allow infrastructure partners to clearly designate when their OpenShift deployments contain components that replace the core Red Hat components.

This work will require updates to the core OpenShift API repository to add the new platform type, and then a distribution of this change to all components that use the platform type information. For components that partners might replace, per-component action will need to be taken, with the project team's guidance, to ensure that the component properly handles the "External" platform. These changes will look slightly different for each component.

To integrate these changes more easily into OpenShift, it is possible to take a multi-phase approach which could be spread over a release boundary (eg phase 1 is done in 4.X, phase 2 is done in 4.X+1).

Phase 1

  • Write platform “External” enhancement.
  • Evaluate changes to cluster capability annotations to ensure coverage for all replaceable components.
  • Meet with component teams to plan specific changes that will allow for supplement or replacement under platform "External".

Phase 2

  • Update OpenShift API with new platform and ensure all components have updated dependencies.
  • Update capabilities API to include coverage for all replaceable components.
  • Ensure all Red Hat operators tolerate the "External" platform and treat it the same as "None" platform.

Phase 3

  • Update components based on identified changes from phase 1
    • Update Machine API operator to run core controllers in platform "External" mode.

Why is this important?

  • As partners begin to supplement OpenShift's core functionality with their own platform specific components, having a way to recognize clusters that are in this state helps Red Hat created components to know when they should expect their functionality to be replaced or supplemented. Adding a new platform type is a significant data point that will allow Red Hat components to understand the cluster configuration and make any specific adjustments to their operation while a partner's component may be performing a similar duty.
  • The new platform type also helps with support to give a clear signal that a cluster has modifications to its core components that might require additional interaction with the partner instead of Red Hat. When combined with the cluster capabilities configuration, the platform "External" can be used to positively identify when a cluster is being supplemented by a partner, and which components are being supplemented or replaced.

Scenarios

  1. A partner wishes to replace the Machine controller with a custom version that they have written for their infrastructure. Setting the platform to "External" and advertising the Machine API capability gives a clear signal to the Red Hat created Machine API components that they should start the infrastructure generic controllers but not start a Machine controller.
  2. A partner wishes to add their own Cloud Controller Manager (CCM) written for their infrastructure. Setting the platform to "External" and advertising the CCM capability gives a clear to the Red Hat created CCM operator that the cluster should be configured for an external CCM that will be managed outside the operator. Although the Red Hat operator will not provide this functionality, it will configure the cluster to expect a CCM.

Acceptance Criteria

Phase 1

  • Partners can read "External" platform enhancement and plan for their platform integrations.
  • Teams can view jira cards for component changes and capability updates and plan their work as appropriate.

Phase 2

  • Components running in cluster can detect the “External” platform through the Infrastructure config API
  • Components running in cluster react to “External” platform as if it is “None” platform
  • Partners can disable any of the platform specific components through the capabilities API

Phase 3

  • Components running in cluster react to the “External” platform based on their function.
    • for example, the Machine API Operator needs to run a set of controllers that are platform agnostic when running in platform “External” mode.
    • the specific component reactions are difficult to predict currently, this criteria could change based on the output of phase 1.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. Identifying OpenShift Components for Install Flexibility

Open questions::

  1. Phase 1 requires talking with several component teams, the specific action that will be needed will depend on the needs of the specific component. At the least the components need to treat platform "External" as "None", but there could be more changes depending on the component (eg Machine API Operator running non-platform specific controllers).

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • As defined in the part (OCPBU-5), this epic is about adding the new "External" platform type and ensuring that the OpenShift operators which react to platform types treat the "External" platform as if it were a "None" platform.
  • Add an end-to-end test to exercise the "External" platform type

Why is this important?

  • This work lays the foundation for partners and users to customize OpenShift installations that might replace infrastructure level components.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

As described in the external platform enhancement , the cluster-cloud-controller-manager-opeartor should be modified to react to the external platform type in the same manner as platform none.

Steps

  • add an extra clause to the platform switch that will group "External" with "None"

Stakeholders

  • openshift eng

Definition of Done

  • CCCMO behaves as if platform None when External is selected
  • Docs
  • developer docs for CCCMO should be updated
  • Testing

Background

As described in the external platform enhancement , the machine-api-operator should be modified to react to the external platform type in the same manner as platform none.

Steps

  • add an extra clause to the platform switch that will group "External" with "None"

Stakeholders

  • openshift eng

Definition of Done

  • MAO behaves as if platform None when External is selected
  • Docs
  • developer docs for MAO should be updated
  • Testing
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Create a severity warning alert to alert to admin that there is packet loss occurring due to failed ovs vswitchd lookups. This may occur if vswitchd is cpu constrained and there are also numerous lookups.

Use metric  ovs_vswitchd_netlink_overflow which shows netlink messages dropped by the vswitchd daemon due to buffer overflow in userspace.

For the kernel equivalent, use metric ovs_vswitchd_dp_flows_lookup_lost . Both metrics usually have the same value but may differ if vswitchd may restart.

Both these metrics should be aggregate into a single alert if the value has increased recently.

 

DoD: QE test case, code merged to CNO, metrics document updated ( https://docs.google.com/document/d/1lItYV0tTt5-ivX77izb1KuzN9S8-7YgO9ndlhATaVUg/edit )

Note: Replace text in red with details of your feature request.

Feature Overview

Extend the Workload Partitioning feature to support multi-node clusters.

Goals

Customers running RAN workloads on C-RAN Hubs (i.e. multi-node clusters) that want to maximize the cores available to the workloads (DU) should be able to utilize WP to isolate CP processes to reserved cores.

Requirements

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts.  If a non MVP requirement slips, it does not shift the feature.

requirement Notes isMvp?
     
     
     

 

Describe Use Cases (if needed)

< How will the user interact with this feature? >

< Which users will use this and when will they use it? >

< Is this feature used as part of current user interface? >

Out of Scope

 

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>

<Does the Feature introduce data that could be gathered and used for Insights purposes?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

< What does success look like?>

< Does this feature have doc impact?  Possible values are: New Content, Updates to existing content,  Release Note, or No Doc Impact>

< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

  • <What concepts do customers need to understand to be successful in [action]?>
  • <How do we expect customers will use the feature? For what purpose(s)?>
  • <What reference material might a customer want/need to complete [action]?>
  • <Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
  • <What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Interoperability Considerations

< Which other products and versions in our portfolio does this feature impact?>

< What interoperability test scenarios should be factored by the layered product(s)?>

Questions

Question Outcome
   

 

 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Create generic validation tests in Origin and Release repo to check that a cluster is correctly configured. E2E tests running in a cpu partitioned cluster should run successfully.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

Problem: It takes more than 2 months to get new data enhancements into insights-operator and released to customers.

Goal:

  • Speed up new releases of insights-operator. Use existing release process and use the fast available channel for non-security updates to delivery new enhancements. 
  • Measure how fast we are releasing new enhancements and how fast we are getting significant portion of customer base to update and start sharing new data. https://issues.redhat.com/browse/CCXDEV-3095
  • Nail down enhancements from - ensure we deliver high impact enhancements first https://issues.redhat.com/secure/RapidBoard.jspa?rapidView=5668
  • Update release notes and share updates with teams working on health checks.
  • Notify in nominations cards that new data enhancements has been added to insights operator and released.

 

 

 

 

Description of problem:

We've got a rule nomination that needs the Virtual Machine objects - see INSIGHTOCP-1074.

The rule will apply to all connected clusters. 
The data is meant to be used for future rules and for internal data that we can use to analyse our users behavior with our operator.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Create a cluster
2. Install openshift virtualization operator
3. Create a VM

Actual results:

 

Expected results:

It should generate the archive related to VMIs resource in IO archive.

Additional info:

 

tldr: three basic claims, the rest is explanation and one example

  1. We cannot improve long term maintainability solely by fixing bugs.
  2. Teams should be asked to produce designs for improving maintainability/debugability.
  3. Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.

While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.

One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.

I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.

We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.


Relevant links:

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • To refactor various unit test in cluster-ingress-operator to align with desire unit test standards. The unit tests are in need of various clean up to meet the standards of the network edge such as:
    • Using t.run in all unit tests for sub-test capabilities
    • Removing extraneous test cases
    • Fixing incorrect error messages

Why is this important?

  • Maintaining standards in unit tests is important for the debug-ability of our code

Scenarios

  1. ...

Acceptance Criteria

  • Unit tests generally meet our software standards

Dependencies (internal and external)

  1.  

Previous Work (Optional):

  1. For shift week, Miciah provided a handful commits https://github.com/Miciah/cluster-ingress-operator/commits/gateway-api that was the motivation to create this epic. 

Open questions::

  1. N/A

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Test_desiredHttpErrorCodeConfigMap contains a section that has dead code when checking for expect == nil || actual == ||. Clean this up.

Also replace Ruby-style #{} syntax for string interpolation with Go string formats.

OCP/Telco Definition of Done

Epic Template descriptions and documentation.

Epic Goal

Why is this important?

Drawbacks

  • N/A

Scenarios

  • CI Testing

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. SDN Team

Previous Work (Optional):

  1. N/A

Open questions::

  1. N/A

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

 

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Complete during New status.

 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Design Doc:

https://docs.google.com/document/d/1m6OYdz696vg1v8591v0Ao0_r_iqgsWjjM2UjcR_tIrM/

Problem:

Goal

As a developer, I want to be able to test my serverless function after it's been deployed.

Why is it important?

Use cases:

  1. As a developer, I want to test my serverless function 

Acceptance criteria:

  1. This features needs to work in ACM (Multi cluster environment when console is being run on the Hub cluster)

Dependencies (External/Internal):

Please add a spike to see if there are dependencies.

Design Artifacts:

Exploration:

Developers can use the the kn func invoke CLI to accomplish this. According to Naina, there is an API, but it's in Go.

Note:

Description

As a user, I want to invoke a Serverless function from the developer console. This action should be available as a page and as a modal.

This story is to evaluate a good UI for this and check this with our PM (Serena) and the Serverless team (Naina and Lance).

Acceptance Criteria

  1. Add a new page with title "Invoke Serverless function {function-name}" and should be available via a new URL (/serverless/ns/:ns/invoke-function/:function-name/).
  2. Implement a form with formik to "invoke" (console.log for now) Serverless functions, without writing the network call for this already. Focus on the UI to get feedback as early as possible. Use reusable, well-named components anyway.
  3. The page should be also available as a modal. Add a new action to all Serverless Services with the label (tbd) to open this modal from the Topology graph or from the Serverless Service list view.
  4. The page should have two tabs or two panes for the request and response. Each of this tabs/panes should have again two tabs, "similar" to the browser network inspector. See below for what we know currently.
  5. Get confirmation from Christoph, Serena, Naina, and Lance.
  6. Disable the action until we implement the network communication in ODC-7275 with the serverless function.
  7. No e2e tests are needed for this story.

Additional Details:

Information the form should show:

  1. Request tab shows "Body" and "Options" tab
    1. Body is just a full size editor. We should reuse our code editor.
    2. Options contains:
      1. Auto complete text field “Content type” with placeholder “application/json”, that will be used when nothing is entered
      2. Dropdown “Format” with values “cloudevent” (default) and “http”
      3. Text field “Type” with placeholder text “boson.fn”, that will be used when nothing is entered
      4. Text field “Source” with placeholder “/boson/fn”, that will be used when nothing is entered
  2. Response tab shows Body and Info tab
    1. Body is a full size editor that shows the response. We should format a JSON string with JSON.stringify(data, null, 2)
    2. Info contains:
      1. Id (id)
      2. Type (type)
      3. Source (source)
      4. Time (time) (formatted)
      5. Content-Type: (datacontenttype)

Description

Current YAMLEditor also supports other languages like JSON. Therefore need to rename the component.

Acceptance Criteria

  1. Rename all instances of YAMLEditor to CodeEditor

Additional Details:

Feature Overview  

Much like core OpenShift operators, a standardized flow exists for OLM-managed operators to interact with the cluster in a specific way to leverage AWS STS authorization when using AWS APIs as opposed to insecure static, long-lived credentials. OLM-managed operators can implement integration with the CloudCredentialOperator in well-defined way to support this flow.

Goals:

Enable customers to easily leverage OpenShift's capabilities around AWS STS with layered products, for increased security posture. Enable OLM-managed operators to implement support for this in well-defined pattern.

Requirements:

  • CCO gets a new mode in which it can reconcile STS credential request for OLM-managed operators
  • A standardized flow is leveraged to guide users in discovering and preparing their AWS IAM policies and roles with permissions that are required for OLM-managed operators 
  • A standardized flow is defined in which users can configure OLM-managed operators to leverage AWS STS
  • An example operator is used to demonstrate the end2end functionality
  • Clear instructions and documentation for operator development teams to implement the required interaction with the CloudCredentialOperator to support this flow

Use Cases:

See Operators & STS slide deck.

 

Out of Scope:

  • handling OLM-managed operator updates in which AWS IAM permission requirements might change from one version to another (which requires user awareness and intervention)

 

Background:

The CloudCredentialsOperator already provides a powerful API for OpenShift's cluster core operator to request credentials and acquire them via short-lived tokens. This capability should be expanded to OLM-managed operators, specifically to Red Hat layered products that interact with AWS APIs. The process today is cumbersome to none-existent based on the operator in question and seen as an adoption blocker of OpenShift on AWS.

 

Customer Considerations

This is particularly important for ROSA customers. Customers are expected to be asked to pre-create the required IAM roles outside of OpenShift, which is deemed acceptable.

Documentation Considerations

  • Internal documentation needs to exists to guide Red Hat operator developer teams on the requirements and proposed implementation of integration with CCO and the proposed flow
  • External documentation needs to exist to guide users on:
    • how to become aware that the cluster is in STS mode
    • how to become aware of operators that support STS and the proposed CCO flow
    • how to become aware of the IAM permissions requirements of these operators
    • how to configure an operator in the proposed flow to interact with CCO

Interoperability Considerations

  • this needs to work with ROSA
  • this needs to work with self-managed OCP on AWS

Market Problem

This Section: High-Level description of the Market Problem ie: Executive Summary

  • As a customer of OpenShift layered products, I need to be able to fluidly, reliably and consistently install and use OpenShift layered product Kubernetes Operators into my ROSA STS clusters, while keeping a STS workflow throughout.
  •  
  • As a customer of OpenShift on the big cloud providers, overall I expect OpenShift as a platform to function equally well with tokenized cloud auth as it does with "mint-mode" IAM credentials. I expect the same from the Kubernetes Operators under the Red Hat brand (that need to reach cloud APIs) in that tokenized workflows are equally integrated and workable as with "mint-mode" IAM credentials.
  •  
  • As the managed services, including Hypershift teams, offering a downstream opinionated, supported and managed lifecycle of OpenShift (in the forms of ROSA, ARO, OSD on GCP, Hypershift, etc), the OpenShift platform should have as close as possible, native integration with core platform operators when clusters use tokenized cloud auth, driving the use of layered products.
  • .
  • As the Hypershift team, where the only credential mode for clusters/customers is STS (on AWS) , the Red Hat branded Operators that must reach the AWS API, should be enabled to work with STS credentials in a consistent, and automated fashion that allows customer to use those operators as easily as possible, driving the use of layered products.

Why it Matters

  • Adding consistent, automated layered product integrations to OpenShift would provide great added value to OpenShift as a platform, and its downstream offerings in Managed Cloud Services and related offerings.
  • Enabling Kuberenetes Operators (at first, Red Hat ones) on OpenShift for the "big3" cloud providers is a key differentiation and security requirement that our customers have been and continue to demand.
  • HyperShift is an STS-only architecture, which means that if our layered offerings via Operators cannot easily work with STS, then it would be blocking us from our broad product adoption goals.

Illustrative User Stories or Scenarios

  1. Main success scenario - high-level user story
    1. customer creates a ROSA STS or Hypershift cluster (AWS)
    2. customer wants basic (table-stakes) features such as AWS EFS or RHODS or Logging
    3. customer sees necessary tasks for preparing for the operator in OperatorHub from their cluster
    4. customer prepares AWS IAM/STS roles/policies in anticipation of the Operator they want, using what they get from OperatorHub
    5. customer's provides a very minimal set of parameters (AWS ARN of role(s) with policy) to the Operator's OperatorHub page
    6. The cluster can automatically setup the Operator, using the provided tokenized credentials and the Operator functions as expected
    7. Cluster and Operator upgrades are taken into account and automated
    8. The above steps 1-7 should apply similarly for Google Cloud and Microsoft Azure Cloud, with their respective token-based workload identity systems.
  2. Alternate flow/scenarios - high-level user stories
    1. The same as above, but the ROSA CLI would assist with AWS role/policy management
    2. The same as above, but the oc CLI would assist with cloud role/policy management (per respective cloud provider for the cluster)
  3. ...

Expected Outcomes

This Section: Articulates and defines the value proposition from a users point of view

  • See SDE-1868 as an example of what is needed, including design proposed, for current-day ROSA STS and by extension Hypershift.
  • Further research is required to accomodate the AWS STS equivalent systems of GCP and Azure
  • Order of priority at this time is
    • 1. AWS STS for ROSA and ROSA via HyperShift
    • 2. Microsoft Azure for ARO
    • 3. Google Cloud for OpenShift Dedicated on GCP

Effect

This Section: Effect is the expected outcome within the market. There are two dimensions of outcomes; growth or retention. This represents part of the “why” statement for a feature.

  • Growth is the acquisition of net new usage of the platform. This can be new workloads not previously able to be supported, new markets not previously considered, or new end users not previously served.
  • Retention is maintaining and expanding existing use of the platform. This can be more effective use of tools, competitive pressures, and ease of use improvements.
  • Both of growth and retention are the effect of this effort.
    • Customers have strict requirements around using only token-based cloud credential systems for workloads in their cloud accounts, which include OpenShift clusters in all forms.
      • We gain new customers from both those that have waited for token-based auth/auth from OpenShift and from those that are new to OpenShift, with strict requirements around cloud account access
      • We retain customers that are going thru both cloud-native and hybrid-cloud journeys that all inevitably see security requirements driving them towards token-based auth/auth.
      •  

References

As an engineer I want the capability to implement CI test cases that run at different intervals, be it daily, weekly so as to ensure downstream operators that are dependent on certain capabilities are not negatively impacted if changes in systems CCO interacts with change behavior.

Acceptance Criteria:

Create a stubbed out e2e test path in CCO and matching e2e calling code in release such that there exists a path to tests that verify working in an AWS STS workflow.

Goals

Track goals/requirements for self-managed GA of Hosted control planes on AWS using the AWS Provider.

  • AWS flow via the AWS provider is documented. 
    • Make sure the documentation with HyperShiftDeployment is removed.
    • Make sure the documentation uses the new flow without HyperShiftDeployment 
  • HyperShift has a UI wizard with ACM/MCE for AWS. 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Overview

Today upstream and the more complete documentation of HyperShift lives on https://hypershift-docs.netlify.app/.

However product documentation today live under https://access.redhat.com/login?redirectTo=https%3A%2F%2Faccess.redhat.com%2Fdocumentation%2Fen-us%2Fred_hat_advanced_cluster_management_for_kubernetes%2F2.6%2Fhtml%2Fmulticluster_engine%2Fmulticluster_engine_overview%23hosted-control-planes-intro 

Goal

The goal of this Epic is to extract important docs and establish parity between what's documented and possible upstream and product documentation.

 

Multiple consumers have not realised a newer version of a CPO (spec.release) is not guaranteed to work with an older HO.

This is stated here https://hypershift-docs.netlify.app/reference/versioning-support/

but empiric evidences like OCM integration are telling us this is not enough.

We already deploy a CM in the HO namespace with the HC supported versions.

Additionally we can add an image label with latest HC version supported by the operator so you can quickly docker inspect...

Feature Overview

Console enhancements based on customer RFEs that improve customer user experience.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

For the console, we would like to have a way for customers to send direct feedback about features like multi cluster.

Acceptance criteria:

  • integrate the pf feedback extension into the console
  • add the ui for rendering the form / launching the feedback url
  • add e2e testing for the customer interaction with the feedback mechanism
  • show/hide the launch mechanism where appropriate (need more info here on this topic)

Testing instructions:

  • Right click the help button in the toolbar.
  • From the help button drop down right click on Share Feedback. Previously this was report a bug but it has now been replaced with Share Feedback.
  • The Share Feedback Modal should appear.
  • Click on the share feedback link, a new tab will appear where you can share feedback.
  • Click on the open a support case link, a new tab will appear where you can report a bug.
  • Click on the the inform the direction of red hat, a new tab will appear where you can enter your information to join redhat mailing list.
  • Click cancel an the modal should close.

OC mirror is GA product as of Openshift 4.11 .

The goal of this feature is to solve any future customer request for new features or capabilities in OC mirror 

Overview

This epic is a simple tracker epic for the proposed work and analysis for 4.14 delivery

Description of problem:

Customer was able to limit the nested repository path with "oc adm catalog mirror" by using the argument "--max-components" but there is no alternate solution along with "oc-mirror" binary while we are suggesting to use "oc-mirror" binary for mirroring.for example:
Mirroring will work if we mirror like below
oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy
Mirroring will fail with 401 unauthorized if we add one more nested path like below
oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy/zzz

Version-Release number of selected component (if applicable):

 

How reproducible:

We can reproduce the issue by using a repository which is not supported deep nested paths

Steps to Reproduce:

1. Create a imageset to mirror any operator

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: ./oc-mirror-metadata
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
    packages:
    - name: local-storage-operator
      channels:
      - name: stable

2. Do the mirroring to a registry where its not supported deep nested repository path, Here its gitlab and its doesnt not support netsting beyond 3 levels deep.

oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy/zzz

this mirroring will fail with 401 unauthorized error
 
3. if  try to mirror the same imageset by removing one path it will work without any issues, like below

oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy 

Actual results:

 

Expected results:

Need a alternative option of "--max-components" to limit the nested path in "oc-mirror"

Additional info:

 

Feature Overview (aka. Goal Summary)  

The Assisted Installer is used to help streamline and improve the install experience of OpenShift UPI. Given the install footprint of OpenShift on IBM Power and IBM zSystems we would like to bring the Assisted Installer experience to those platforms and easy the installation experience.

 

Goals (aka. expected user outcomes)

Full support of the Assisted Installer for use by IBM Power and IBM zSystems

 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

As a multi-arch development engineer, I would like to evaluate if the assisted installer is a good fit for simplifying UPI deployments on Power and Z.

Acceptance Criteria

  • Evaluation report of market opportunity/impact by P&Z offering managers
  • Stories filed for delivering Assisted Installer.

 
After doing more tests on staging for Power, I have found that the cluster managed network would not  work for Power, it uses the platform.baremetal  to define API-VIP/INGRESS-VIP, most the installations have failed at the last step finalizing. After more dig, found that the machine-api operator   would not be able to start successfully, and stay in Operator is initializing  state, here is the list of the pod with error:

openshift-kube-controller-manager installer-5-master-1 0/1 Error 0 25m
openshift-kube-controller-manager installer-6-master-2 0/1 Error 0 17m
openshift-machine-api ironic-proxy-kgm9g 0/1 CreateContainerError 0 32m
openshift-machine-api ironic-proxy-nc2lz 0/1 CreateContainerError 0 8m37s
openshift-machine-api ironic-proxy-pp92t 0/1 CreateContainerError 0 32m
openshift-machine-api metal3-69b945c7ff-45hqn 1/5 CreateContainerError 0 33m
openshift-machine-api metal3-image-customization-7f6c8978cf-lxbj7 0/1 CreateContainerError 0 32m

the messages from failed pod ironic-proxy-nc2lz:

Normal Pulled 11m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4f84fd895186b28af912eea42aba1276dec98c814a79310c833202960cf05407" in 1.29310959s (1.293135461s including waiting)
Warning Failed 11m kubelet Error: container create failed: time="2023-04-06T15:16:19Z" level=error msg="runc create failed: unable to start container process: exec: \"/bin/runironic-proxy\": stat /bin/runironic-proxy: no such file or directory"

similar errors for other failed pods.
The interesting thing is some of the installation got installed in AI successfully, but these pods still are in error state.
So I ask AI team to turn off the support Cluster network support for Power.

Description of the problem:

power and z features are not displayed in the feature usage dashboard in the elastic because there is a problem in the code
see https://kibana-assisted.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/_dashboards/app/dashboards#/view/f75f85d0-989e-11ec-ab6b-650fa8ed1edf?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-2w,to:now))&_a=(description:'',filters:!(('$state':(store:appState),meta:(alias:'internal%20users',disabled:!t,index:bd9dadc0-7bfa-11eb-95b8-d13a1970ae4d,key:cluster.email_domain,negate:!f,params:!(redhat.com,ibm.com),type:phrases,value:'redhat.com,%20ibm.com'),query:(bool:(minimum_should_match:1,should:!((match_phrase:(cluster.email_domain:redhat.com)),(match_phrase:(cluster.email_domain:ibm.com))))))),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),panels:!((embeddableConfig:(),gridData:(h:11,i:c9bf6a4b-3c3a-4ad4-83ea-20b3127dc4a0,w:16,x:0,y:0),id:'44328ca6-de41-4b1e-befd-683bb51cf30f',panelIndex:c9bf6a4b-3c3a-4ad4-83ea-20b3127dc4a0,type:visualization,version:'1.3.2'),(embeddableConfig:(),gridData:(h:11,i:'759bd387-9f4b-45cb-9c9a-b3c412b420ec',w:16,x:16,y:0),id:ffbb52b5-dbd9-47c3-8098-75513cddca8e,panelIndex:'759bd387-9f4b-45cb-9c9a-b3c412b420ec',type:visualization,version:'1.3.2'),(embeddableConfig:(),gridData:(h:15,i:eb038dca-baf4-42d4-8e2c-298e2bbd06f6,w:16,x:32,y:0),id:'49b26e77-a2f3-42f3-8f57-9543669de8b8',panelIndex:eb038dca-baf4-42d4-8e2c-298e2bbd06f6,type:visualization,version:'1.3.2'),(embeddableConfig:(),gridData:(h:4,i:'05c0d27a-f949-42d6-a3ef-15411411fac7',w:16,x:0,y:11),id:'088f04c9-ce46-46d0-a381-1ea822d95440',panelIndex:'05c0d27a-f949-42d6-a3ef-15411411fac7',type:visualization,version:'1.3.2'),(embeddableConfig:(),gridData:(h:4,i:'0d696304-e4a4-4c24-beb4-f04d4af4c8d6',w:16,x:16,y:11),id:'9552a99a-4355-4e14-ad9f-90cd534f70a8',panelIndex:'0d696304-e4a4-4c24-beb4-f04d4af4c8d6',type:visualization,version:'1.3.2'),(embeddableConfig:(vis:!n),gridData:(h:10,i:ee55c626-4b30-4ac3-ad23-9b7efbf1fb04,w:16,x:0,y:15),id:fac35afd-9a6f-4bdc-868a-906a5f1e1894,panelIndex:ee55c626-4b30-4ac3-ad23-9b7efbf1fb04,type:visualization,version:'1.3.2'),(embeddableConfig:(),gridData:(h:10,i:c4c65cab-2675-4575-aa54-d4bb2871804e,w:32,x:16,y:15),id:'3747662f-7c12-4299-b2ac-1038e62ad2f3',panelIndex:c4c65cab-2675-4575-aa54-d4bb2871804e,type:visualization,version:'1.3.2'),(embeddableConfig:(),gridData:(h:13,i:'1ca7879c-a458-4ae3-8dc2-4dd2da59cf32',w:15,x:0,y:25),id:'96e57324-141d-46ea-8096-9e8b1a18ef62',panelIndex:'1ca7879c-a458-4ae3-8dc2-4dd2da59cf32',type:visualization,version:'1.3.2')),query:(language:kuery,query:''),timeRestore:!f,title:'%5BAI%5D%20feature_usage_dashboard',viewMode:edit) 

How reproducible:

100%

Steps to reproduce:

1. install 2 clusters with power and z CPU architectures and check the feature usage dashboard in the elastic

Actual results:

power and z features are not displayed in the feature usage dashboard in the elastic

Expected results:

see the power and z features in the feature usage dashboard in the elastic 

Feature Overview

  • As a Cluster Administrator, I want to opt-out of certain operators at deployment time using any of the supported installation methods (UPI, IPI, Assisted Installer, Agent-based Installer) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
  • As a Cluster Administrator, I want to opt-in to previously-disabled operators (at deployment time) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
  • As a ROSA service administrator, I want to exclude/disable Cluster Monitoring when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — since I get cluster metrics from the control plane.  This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.
  • As a ROSA service administrator, I want to exclude/disable Ingress Operator when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — as I want to use my preferred load balancer (i.e. AWS load balancer).  This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.

Goals

  • Make it possible for customers and Red Hat teams producing OCP distributions/topologies/experiences to enable/disable some CVO components while still keeping their cluster supported.

Scenarios

  1. This feature must consider the different deployment footprints including self-managed and managed OpenShift, connected vs. disconnected (restricted and air-gapped), supported topologies (standard HA, compact cluster, SNO), etc.
  2. Enabled/disabled configuration must persist throughout cluster lifecycle including upgrades.
  3. If there's any risk/impact of data loss or service unavailability (for Day 2 operations), the System must provide guidance on what the risks are and let user decide if risk worth undertaking.

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This part of the overall multiple release Composable OpenShift (OCPPLAN-9638 effort), which is being delivered in multiple phases:

Phase 1 (OpenShift 4.11): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators

  • CORS-1873 Installer to allow users to select OpenShift components to be included/excluded
  • OTA-555 Provide a way with CVO to allow disabling and enabling of operators
  • OLM-2415 Make the marketplace operator optional
  • SO-11 Make samples operator optional
  • METAL-162 Make cluster baremetal operator optional
  • OCPPLAN-8286 CI Job for disabled optional capabilities

Phase 2 (OpenShift 4.12): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators

Phase 3 (OpenShift 4.13): OCPBU-117

  • OTA-554 Make oc aware of cluster capabilities
  • PSAP-741 Make Node Tuning Operator (including PAO controllers) optional

Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)

  • CCO-186 ccoctl support for credentialing optional capabilities
  • MCO-499 MCD should manage certificates via a separate, non-MC path (formerly IR-230 Make node-ca managed by CVO)
  • CNF-5642 Make cluster autoscaler optional
  • CNF-5643 - Make machine-api operator optional
  • WRKLDS-695 - Make DeploymentConfig API + controller optional
  • CNV-16274 OpenShift Virtualization on the Red Hat Application Cloud (not applicable)
  • CNF-9115 - Leverage Composable OpenShift feature to make control-plane-machine-set optional

Phase 5 (OpenShift 4.15): OCPSTRAT-421 (formerly) OCPBU-519

  • OCPBU-352 Make Ingress Operator optional
  • BUILD-565 - Make Build v1 API + controller optional
  • OBSDA-242 Make Cluster Monitoring Operator optional
  • OCPVE-630 (formerly CNF-5647) Leverage Composable OpenShift feature to make image-registry optional (replaces IR-351 - Make Image Registry Operator optional)
  • CNF-9114 - Leverage Composable OpenShift feature to make olm optional
  • CNF-9118 - Leverage Composable OpenShift feature to make cloud-credential  optional
  • CNF-9119 - Leverage Composable OpenShift feature to make cloud-controller-manager optional

Phase 6 (OpenShift 4.16): OCPSTRAT-731

  • TBD

References

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

 

 

Per https://github.com/openshift/enhancements/pull/922 we need `oc adm release new` to parse the resource manifests for `capability` annotations and generate a yaml file that lists the valid capability names, to embed in the release image.

This file can be used by the installer to error or warn when the install config lists capabilities for enable/disable that are not valid capability names.

 

Note: Moved the couple of cards from OTA-554 to this epic as these cards are relatively less priority for 4.13 release and we could not mark these done.

While working on OTA-559, my oc#1237 broke JSON output, and needed a follow-up fix. To avoid destabilizing folks who consume the dev-tip oc, we should grow CI presubmits to exercise critical oc adm release ... pathways, to avoid that kind of accidental breakage.

Epic Goal

Feature Overview (aka. Goal Summary)  

The goal of this initiative to help boost adoption of OpenShift on ppc64le. This can be further broken down into several key objectives.

  • For IBM, furthering adopt of OpenShift will continue to drive adoption on their power hardware. In parallel, this can be used for existing customers to migrate their old power on-prem workloads to a cloud environment.
  • For the Multi-Arch team, this represents our first opportunity to develop an IPI offering on one of the IBM platforms. Right now, we depend on IPI on libvirt to cover our CI needs; however, this is not a supported platform for customers. PowerVS would address this caveat for ppc64le.
  • By bringing in PowerVS, we can provide customers with the easiest possible experience to deploy and test workloads on IBM architectures.
  • Customers already have UPI methods to solve their OpenShift on prem needs for ppc64le. This gives them an opportunity for a cloud based option, further our hybrid-cloud story.

Goals (aka. expected user outcomes)

  • The goal of this epic to begin the process of expanding support of OpenShift on ppc64le hardware to include IPI deployments against the IBM Power Virtual Server (PowerVS) APIs.

Requirements (aka. Acceptance Criteria):

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal

  • Improve IPI on Power VS in the 4.14 cycle
    • Changes to the installer to handle edge cases, fix bugs, and improve usability.
    • No major changes are anticipated this cycle.

Running doc to describe terminologies and concepts which are specific to Power VS - https://docs.google.com/document/d/1Kgezv21VsixDyYcbfvxZxKNwszRK6GYKBiTTpEUubqw/edit?usp=sharing

Feature Overview (aka. Goal Summary)  

During oc login with a token, pasting the token on command line with oc login --token command is insecure. The token is logged in bash history, and appears in a "ps" command when ran precisely at the time the oc login command runs. Moreover, the token gets logged and is searchable by any sysadmin.

Customers/Users would like either the "--web" command, or a command that prompt for a token. There should be no way to pass a secret on a command line with --token command. 

For environments where no web browser is available, a "--ask-token" option should be provided that prompts for a token instead of passing it on the command line.

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal*

During oc login with a token, pasting the token on command line with oc login --token command is insecure. The token is logged in bash history, and appears in a "ps" command when ran precisely at the time the oc login command runs. Moreover, the token gets logged and is searchable by any sysadmin.

Customers/Users would like either the "--web" command, or a command that prompt for a token. There should be no way to pass a secret on a command line with --token command. 

For environments where no web browser is available, a "--ask-token" option should be provided that prompts for a token instead of passing it on the command line.

 
Why is this important? (mandatory)

Pasting the token on command line with oc login --token command is insecure

 
Scenarios (mandatory) 

Customers/Users would like either the "--web" command. There should be no way to pass a secret on a command line with --token command. 

For environments where no web browser is available, a "--ask-token" option should be provided that prompts for a token instead of passing it on the command line.

 

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

 

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

In order for the OAuth2 Authorization Code Grant Flow to work in oc browser login, the OSIN server must ignore any port used in the Redirect URIs of the flow when the URIs are the loopback addresses. This has already been added to OSIN; we need to update the oauth-server to use the latest version of OSIN in order to make use of this capability.

Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster. 
Why customers want this?

  1. Single interface to accomplish their tasks
  2. Consistent UX and patterns
  3. Easily accessible: One URL, one set of credentials

Why we want this?

  • Shared code -  improve the velocity of both teams and most importantly ensure consistency of the experience at the code level
  • Pre-built PF4 components
  • Accessibility & i18n
  • Remove barriers for enabling ACM

Phase 2 Goal: Productization of the united Console 

  1. Enable user to quickly change context from fleet view to single cluster view
    1. Add Cluster selector with “All Cluster” Option. “All Cluster” = ACM
    2. Shared SSO across the fleet
    3. Hub OCP Console can connect to remote clusters API
    4. When ACM Installed the user starts from the fleet overview aka “All Clusters”
  2. Share UX between views
    1. ACM Search —> resource list across fleet -> resource details that are consistent with single cluster details view
    2. Add Cluster List to OCP —> Create Cluster

We need a way to show metrics for workloads running on spoke clusters. This depends on ACM-876, which lets the console discover the monitoring endpoints.

  • Console operator must discover the external URLs for monitoring
  • Console operator must pass the URLs and CA files as part of the cluster config to the console backend
  • Console backend must set up proxies for each endpoint (as it does for the API server endpoints)
  • Console frontend must include the cluster in metrics requests

Open Issues:

We will depend on ACM to create a route on each spoke cluster for the prometheus tenancy service, which is required for metrics for normal users.

 

Openshift console backend should proxy managed cluster monitoring requests through the MCE cluster proxy addon to prometheus services on the managed cluster. This depends on https://issues.redhat.com/browse/ACM-1188

 

This epic contains all the OLM related stories for OCP release-4.14

Epic Goal

  • Track all the stories under a single epic

1. Proposed title of this feature request

    Add a scroll bar for the resource list in the Uninstall Operator pops-up window
2. What is the nature and description of the request?

   To make user easy to check the list of all resources
3. Why does the customer need this? (List the business requirements here)

   For customers, one operator may have multiple resources, it would be easy for them to check them all in Uninstall Operator pops-up window with the scroll bar
4. List any affected packages or components.

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Console operator should be building up a set of cluster nodes OS types, which he should supply to console, so it renders only operators that could be installed on the cluster.

This will be needed when we will support different OS types on the cluster.

We need to scan through the compute nodes and build a set of supported OS from those. Each node on the cluster has a label for its operating system: e.g. kubernetes.io/os=linux,

 

AC:

  1. Implement logic in the console repo
    1. Add additional flag
    2. populate the supported OS types into SERVER_FLAGS
    3. update the filtering logic in the operator hub

Console operator should be building up a set of cluster nodes OS types, which he should supply to console, so it renders only operators that could be installed on the cluster.

This will be needed when we will support different OS types on the cluster.

We need to scan through the compute nodes and build a set of supported OS from those. Each node on the cluster has a label for its operating system: e.g. kubernetes.io/os=linux,

 

AC:

  1. Implement logic in the console-operator that will scan though all the nodes and build a set of all the OS types that the cluster nodes run on and pass it to the console-config.yaml . This set of OS types will be then used by console frontend.
  2. Add unit and e2e test cases in the console-operator repository.

BU Priority Overview

Initiative: Improve etcd disaster recovery experience (part1)

Goals

The current etcd backup and recovery process is described in our docs https://docs.openshift.com/container-platform/4.12/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html

The current process leaves up to the cluster-admin to figure out a way to do consistent backups following the documented procedure.

This feature is part of a progressive delivery to improve the cluster-admin experience for backup and restore of etcd clusters to a healthy state.

Scope of this feature:

  • etcd quorum loss (2 node failure) on a 3 nodes OCP control plane
  • etcd degradation (1 node failure) on a 3 nodes OCP control plane

Execution Plans

  • Improve etcd disaster recovery e2e test coverage
  • Design automated backup API. Initial target is local destination
  • Should provide a way (e.g. script or tool) for cluster-admin to validate backup files remains valid over time (e.g. account for disk failures corrupting the backup)
  • Should document updated manual steps to restore from local backup. These steps should be part of the e2e test coverage.
  • Should document manual manual steps to copy backups files to destination outside the cluster. (e.g. ssh copy a cluster admin can use in a CronJob)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

For testing the automated backups feature we will require an e2e test that validates the backups by ensuring the restore procedure works for a quorum loss disaster recovery scenario.

See the following doc for more background:
https://docs.google.com/document/d/1NkdOwo53mkNBCktV5tkUnbM4vi7bG4fO5rwMR0wGSw8/edit?usp=sharing

This story targets the first milestone of the restore test to ensure we have a platform agnostic way to be able to ssh access all masters in a test cluster so that we can perform the necessary backup, restore and validation workflows.

The suggested approach is to create a static pod that can do those ssh checks and actions from within the cluster but other alternatives can also be explored as part of this story. 

For testing the automated backups feature we will require an e2e test that validates the backups by ensuring the restore procedure works for a quorum loss disaster recovery scenario.

See the following doc for more background:
https://docs.google.com/document/d/1NkdOwo53mkNBCktV5tkUnbM4vi7bG4fO5rwMR0wGSw8/edit?usp=sharing

This story targets the milestone 2,3 and 4 of the restore test to ensure that the test has the ability to perform a backup and then restore from that backup in a disaster recovery scenario.

While the automated backups API is still in progress, the test will rely on the existing backup script to trigger a backup. Later on when we have a functional backup API behind a feature gate, the test can switch over to using that API to trigger backups.

We're starting with a basic crash-looping member restore first. The quorum loss scenario will be done in ETCD-423.

Feature Overview

Enable release managers/Operator authors to manage Operator releases in the file-based catalog (FBC) based on the existing catalog (in sqlite) and distribute them to multiple OCP versions at ease.

Goals

  • Operator releases can be managed declaratively in a canonical source of truth and automated via git in the context of the OpenShift release lifecycle.
  • File-based catalog (FBC) can be converted back to sqlite format in order to be distribute to those OCP versions that do not support file-based catalog yet.
  • Existing catalog image in sqlite format can be converted to the basic template of file-based catalog (FBC) for easy adoption.
  • Existing catalog image in sqlite format can be converted to the semver template of file-based catalog (FBC) when possible and/or highlights the uncompleted sections so users can easier identify the gaps. 

Requirements

Requirement Notes isMvp?
A declarative mechanism to automate the catalog update process in file-based catalog (FBC) with newly-published bundle references.   Yes
A declarative mechanism to publish Operator releases in file-based catalog (FBC) to multiple OCP releases.   Yes
A declarative mechanism to convert file-based catalog (FBC) to sqlite database format so it can be publish to OCP versions without FBC supports.    Yes
A declarative mechanism to convert existing catalog from sqlite database to file-based catalog (FBC) basic template.   Yes
A declarative mechanism to convert existing catalog from sqlite database to file-based catalog (FBC) semver template when possible and/or highlights the uncompleted sections so users can easier identify the gaps.    NO
CI - MUST be running successfully with test automation This is a requirement for ALL features. Yes
Release Technical Enablement Provide necessary release enablement details and documents. Yes

Use Cases

  • Operator authors/release managers can manage releases (i.e., edit the update paths) in a canonical source of truth (in FBC) and automate it via git to simplify the bundle release process.
  • Operator authors/release managers can mange and publish Operator releases from a canonical source of truth (in FBC) to multiple OCP versions.
  • Operator authors/release managers can mange and publish Operator releases from a canonical source of truth (in FBC) to older OCP versions without FBC supported yet.
  • Operator authors/release managers can convert their existing catalog images in sqlite format to the basic template of file-based catalog (FBC) to jumpstart the catalog migration process.
  • Operator authors/release managers can convert their existing catalog images in sqlite format to the semver template of file-based catalog (FBC), when possible to drive adoption, and/or highlights the uncompleted sections so users can easier identify the gaps. 

Definition of Done / Acceptance criteria

  • All use cases above are implemented and meet the requirements.

Background, and strategic fit

A catalog maintainer frequently needs to make changes to an OLM catalog whenever a new software version is released, promoting an existing version and releasing it to a different channel, or deprecating an existing version.  All these often require non-trivial changes to the update graph of an Operator package.  The maintainers need a git- and human-friendly maintenance approach that allows reproducing the catalog at all times and is decoupled from the release of their individual software versions.  

The original imperative catalog maintenance approach, which relies on `replaces`, `skips`, `skipRange` attributes at the bundle level to define the relationships between versions and the update channels, is perceived as complicated from the Red Hat internal developer community.  Hence, the new file-based catalog (FBC) is introduced with a declarative fashion and GitOps-friendly. 

Furthermore, the concept so-called “template”, as an abstraction layer of the FBC, is introduced to simplify interacting with FBCs.  While the “basic template” serves as a simplified abstraction of an FBC with all the `replaces`, `skips`, `skipRange` attributes supported and configurable at the package level, the “semver template” provides the capability to auto-generate an entire upgrade graph adhering to Semantic Versioning (semver) guidelines and consistent with best practices on channel naming.  

Based on the feedback in KubeCon NA 2022, folks were all generally excited to the features introduced with FBC and the UX provided by the templates.  What is still missing is the tooling to enable the adoption.  

Therefore, it is important to allow users to:

  • convert the existing catalog image in sqlite format to the basic template of file-based catalog (FBC) for easy adoption
  • convert the existing catalog image in sqlite format to the semver template of file-based catalog (FBC) when possible and/or highlights the uncompleted sections so users can easier identify the gaps
  • automate the catalog update process using FBC with newly-published bundle references
  • publish Operator releases in file-based catalog (FBC) to multiple OCP releases
  • convert file-based catalog (FBC) back to sqlite database format so it can be publish to OCP versions without FBC supports 

to help users adopt this novel file-based catalog approach and deliver value to customers with a faster release cadence and higher confidence. 

Documentation Considerations

  • The way ”to automate the catalog update process in FBC with newly-published bundle references” needs to be documented (in the context of “Developing Operators).
  • The way ”to to publish Operator releases in file-based catalog (FBC) to multiple OCP releases” needs to be documented (in the context of “Developing Operators” and “Administrator Tasks).
  • The way ”to convert file-based catalog (FBC) to sqlite database format so it can be publish to OCP versions without FBC supports” needs to be documented (in the context of “Developing Operators” and “Administrator Tasks).
  • The way ”to convert existing catalog from sqlite database to file-based catalog (FBC) basic template” needs to be documented (in the context of “Developing Operators).
  • The way ”to convert existing catalog from sqlite database to file-based catalog (FBC) semver template when possible and/or highlights the uncompleted sections so users can easier identify the gaps” needs to be documented (in the context of “Developing Operators). 

 
 
 
 

 

Epic Goal

  • SQlite catalog maintainers need a solution to facilitate veneer adoption.  The easiest capability to provide is migration to the basic veneer.  In addition, the mechanism needs to omit any properties from the original source which are no longer relevant in the new format.

Why is this important?

  • Minimizing friction to veneer adoption is key to speeding the FBC transition

Scenarios

  1. Maintainer wants to update legacy catalog to veneer
  2. operator author wants to update their catalog contribution to veneer

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Documentation - MUST have supporting documentation easily available to catalog maintainers & operator authors
  •  

Open questions::

  1. for the migration path, is documentation of current solution (opm render +  yq/jq) sufficient or do we need to support in formal tooling (e.g. opm migrate + flag)?
  2. are there any other obsolete properties we need to omit from rendered FBC?

 

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

Previous bundle deprecation was handled by assigning a property to the olm.bundle object of `olm.deprecated`.  SQLite DBs had to have all valid upgrade edges supported by olm.bundle information in order to prevent foreign key violations.  This property meant that the bundle was to be ignored & never installed.

FBC has a simpler method for achieving the same goal:  don't include the bundle.  Upgrade edges from it may still be specified, and the bundle will not be installable.

Likely an update to opm code base in the neighborhood of https://github.com/operator-framework/operator-registry/blob/249ae621bb8fa6fc8a8e4a5ae26355577393f127/pkg/sqlite/conversion.go#L80

A/C:

  • CI/utest/e2e passes without flakes
  • appropriate documentation (all upstream) updated/reviewed

 

 

 

 

 

 

 

Feature Overview (aka. Goal Summary):

 

This feature will allow an x86 control plane to operate with compute nodes of type Arm in a HyperShift environment.

 

Goals (aka. expected user outcomes):

 

Enable an x86 control plane to operate with an Arm data-plane in a HyperShift environment.

 

Requirements (aka. Acceptance Criteria):

 

  • The feature must allow an x86 control plane and an Arm data-plane to be used together in a HyperShift environment.
  • The feature must provide documentation on how to set up and use the x86 control plane with an Arm data-plane in a HyperShift environment.
  • The feature must be tested and verified to work reliably and securely in a production environment.

 

Customer Considerations:

 

Customers who require a mix of x86 control plane and Arm data-plane for their HyperShift environment will benefit from this feature.

 

Documentation Considerations:

 

  • Documentation should include clear instructions on how to set up and use the x86 control plane with an Arm data-plane in a HyperShift environment.
  • Documentation will live on docs.openshift.com

 

Interoperability Considerations:

 

This feature should not impact other OpenShift layered products and versions in the portfolio.

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission. 

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn". 

Feature Overview

  • Customers want to create and manage OpenShift clusters using managed identities for Azure resources for authentication.

Goals

  • A customer using ARO wants to spin up an OpenShift cluster with "az aro create" without needing additional input, i.e. without the need for an AD account or service principal credentials, and the identity used is never visible to the customer and cannot appear in the cluster.
  • As an administrator, I want to deploy OpenShift 4 and run Operators on Azure using access controls (IAM roles) with temporary, limited privilege credentials.

Requirements

  • Azure managed identities must work for installation with all install methods including IPI and UPI, work with upgrades, and day-to-day cluster lifecycle operations.
  • Support HyperShift and non-HyperShift clusters.
  • Support use of Operators with Azure managed identities.
  • Support in all Azure regions where Azure managed identity is available. Note: Federated credentials is associated with Azure Managed Identity, and federated credentials is not available in all Azure regions.

More details at ARO managed identity scope and impact.

 

This Section: A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

References

Epic Overview

  • Enable customers to create and manage OpenShift clusters using managed identities for Azure resources for authentication.
  • A customer using ARO wants to spin up an OpenShift cluster with "az aro create" without needing additional input, i.e. without the need for an AD account or service principal credentials, and the identity used is never visible to the customer and cannot appear in the cluster.

Epic Goal

  • A customer creates an OpenShift cluster ("az aro create") using Azure managed identity.
  • Azure managed identities must work for installation with all install methods including IPI and UPI, work with upgrades, and day-to-day cluster lifecycle operations.
  • After Azure failed to implement workable golang API changes after deprecation of their old API, we have removed mint mode and work entirely in passthrough mode. Azure has plans to implement pod/workload identity similar to how they have been implemented in AWS and GCP, and when this feature is available, we should implement permissions similar to AWS/GCP
  • This work cannot start until Azure have implemented this feature - as such, this Epic is a placeholder to track the effort when available.

Why is this important?

  • Microsoft and the customer would prefer that we use Managed Identities vs. Service Principal (which requires putting the Service Principal and principal password in clear text within the azure.conf file).

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

 

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

BU Priority Overview

Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) with VMs

Goals

  • Enable installation of OpenShift 4 on Oracle Cloud Infrastructure (OCI) with VMs using platform agnostics with Assisted Installer.
  • OpenShift 4 on OCI (with VMs) can be updated that results in a cluster and applications that are in a healthy state when update is completed.
  • Telemetry reports back on clusters using OpenShift 4 on OCI for connected OpenShift clusters (e.g. platform=none using Oracle CSI).

State of the Business

Currently, we don't yet support OpenShift 4 on Oracle Cloud Infrastructure (OCI), and we know from initial attempts that installing OpenShift on OCI requires the use of a qcow (OpenStack qcow seems to work fine), networking and routing changes, storage issues, potential MTU and registry issues, etc.

Execution Plans

TBD based on customer demand.

 

Why is this important

  • OCI is starting to gain momentum.
  • In the Middle East (e.g. Saudi Arabia), only OCI and Alibaba Cloud are approved hyperscalars.

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

RFEs:

  • RFE-3635 - Supporting Openshift on Oracle Cloud Infrastructure(OCI) & Oracle Private Cloud Appliance (PCA)

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

Other

 

 

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

The taint here: https://github.com/openshift/assisted-installer/pull/629/files#diff-1046cc2d18cf5f82336bbad36a2d28540606e1c6aaa0b5073c545301ef60ffd4R593

should only be removed when platform is nutanix or vsphere because the credentials for these platforms are passed after cluster installation.

In the opposite with Oracle Cloud the instance gets its credentials through the instance metadata, and should be able to label the nodes from the beginning of the installation without any user intervention.

This feature is the place holder for all epics related to technical debt associated with Console team 

Outcome Overview

Once all Features and/or Initiatives in this Outcome are complete, what tangible, incremental, and (ideally) measurable movement will be made toward the company's Strategic Goal(s)?

 

Success Criteria

What is the success criteria for this strategic outcome?  Avoid listing Features or Initiatives and instead describe "what must be true" for the outcome to be considered delivered.

 

 

Expected Results (what, how, when)

What incremental impact do you expect to create toward the company's Strategic Goals by delivering this outcome?  (possible examples:  unblocking sales, shifts in product metrics, etc. + provide links to metrics that will be used post-completion for review & pivot decisions). {}For each expected result, list what you will measure and when you will measure it (ex. provide links to existing information or metrics that will be used post-completion for review and specify when you will review the measurement such as 60 days after the work is complete)

 

 

Post Completion Review – Actual Results

After completing the work (as determined by the "when" in Expected Results above), list the actual results observed / measured during Post Completion review(s).

 

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

Goals

  • Make kubelet aware of underlying node shutdown event and trigger pod termination with sufficient grace period to shutdown properly
  • Handle node shutdown in cloud-provider agnostic way
  • Introduce minimal shutdown delay in order to shutdown node soon as possible (but not sooner)
  • Focus on handling shutdown on systemd based machines

Story 1

  • As a cluster administrator, I can configure the nodes in my cluster to allocate X seconds for my pods to terminate gracefully during a node shutdown

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2000-graceful-node-shutdown#story-2Story 2

  • As a developer I can expect that my pods will terminate gracefully during node shutdowns

 

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2000-graceful-node-shutdown 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As an OpenShift developer, I want to have confidence that the graceful restart feature works and stays working in the future through various code changes. To that end, please add at least the following 2 E2E tests:

  • A valid pod/workload with timeout that is respected by the system before shutdown.
  • A rogue pod that has extremely high timeout that is not respected by the system.

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.

This includes (but is not limited to):

  • Kubernetes:
    • client-go
    • controller-runtime
  • OCP:
    • library-go
    • openshift/api
    • openshift/client-go
    • operator-sdk

Operators:

  • aws-ebs-csi-driver-operator 
  • aws-efs-csi-driver-operator
  • azure-disk-csi-driver-operator
  • azure-file-csi-driver-operator
  • openstack-cinder-csi-driver-operator
  • gcp-pd-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • csi-driver-manila-operator
  • vmware-vsphere-csi-driver-operator
  • alibaba-disk-csi-driver-operator
  • ibm-vpc-block-csi-driver-operator
  • csi-driver-shared-resource-operator
  • ibm-powervs-block-csi-driver-operator

 

  • cluster-storage-operator
  • cluster-csi-snapshot-controller-operator
  • local-storage-operator
  • vsphere-problem-detector

EOL, do not upgrade:

  • github.com/oVirt/csi-driver-operator

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned components should be running on Kubernetes 1.27
  • This includes
    • The cluster autoscaler (+operator)
    • Machine API operator
      • Machine API controllers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cloud Controller Manager Operator
      • Cloud controller managers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cluster Machine Approver
    • Cluster API Actuator Package
    • Control Plane Machine Set Operator

Why is this important?

  • ...

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

 
Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview

Agent-based installer requires to boot the generated ISO on the target nodes manually. Support for PXE booting will allow customers to automate their installations via their  DHCP/PXE infrastructure. 

This feature allows generating installation ISOs ready to add to a customer-provided DHCP/PXE infrastructure.

Goals

As an OpenShift installation admin I want to PXE-boot the image generated by the openshift-install agent subcommand

Why is this important?

We have customers requesting this booting mechanism to make it easier to automate the booting of the nodes without having to actively place the generated image in a bootable device for each host.

Epic Goal

As an OpenShift installation admin I want to PXE-boot the image generated by the openshift-install agent subcommand

Why is this important?

We have customers requesting this booting mechanism to make it easier to automate the booting of the nodes without having to actively place the generated image in a bootable device for each host.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Goals

Track goals/requirements for self-managed GA of Hosted control planes on BM using the agent provider. Mainly make sure: 

  • BM flow via the Agent is documented. 
    • Make sure the documentation with HyperShiftDeployment is removed.
    • Make sure the documentation uses the new flow without HyperShiftDeployment 
  • We have a reference architecture on the best way to deploy. 
  • UI for provisioning BM via MCE/ACM is complete (w host inventory). 

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Background, and strategic fit

Customers are looking at HyperShift to deploy self-managed clusters on Baremetal. We have positioned the Agent flow as the way to get BM clusters due to its ease of use (it automates many of the rather mundane tasks required to setup up BM clusters) and its planned for GA with MCE 2.3 (in the OCP 4.13 timeframe). 

 

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Feature goal (what are we trying to solve here?)

Group all tasks for CAPI-provider-agent GA readiness

Does it need documentation support?

no

Feature origin (who asked for this feature?)

  •  

Reasoning (why it’s important?)

  • In order for the Hypershift Agent platform to be GA in ACM 2.9 we need to improve our coverage and fix the bugs in this epic 

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist anywhere

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • Does this feature exist in the UI of other installers?

The test wait until all pods in the control plane namespace report ready status but collect-profiles is a job that sometimes complete before other pods are ready.

Once the collect-profiles pod is completed it termintates and the status moves to ready=false.
And from there onwards the test is stuck.

To run a HyperShift management cluster in disconnected mode we need to document which images need to be mirrored and potentially modify the images we use for OLM catalogs.

ICSP mapping only happens for image references with a digest, not a regular tag. We need to address this for images we reference by tag:
CAPI, CAPI provider, OLM catalogs

< High-Level description of the feature ie: Executive Summary >

Goals

< Who benefits from this feature, and how? What is the difference between today’s current state and a world with this feature? >

Requirements

Requirements Notes IS MVP
     
    • (Optional) Use Cases

< What are we making, for who, and why/what problem are we solving?>

Out of scope

<Defines what is not included in this story>

Dependencies

< Link or at least explain any known dependencies. >

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

What does success look like?

< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>

QE Contact

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Impact

< If the feature is ordered with other work, state the impact of this feature on the other work>

Related Architecture/Technical Documents

<links>

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in production environment

Goal

Additional improvements to segment, to enable the proper gathering of user telemetry and analysis

Problem

Currently, we have no accurate telemetry of the OpenShift Console usage across all fleet clusters. We should be able to utilize the auth and console telemetry to glean details which will allow us to get a picture of console usage by our customers.

There is no way to properly track specific pages

  1. Page titles are localized
  2. Details pages include the project name

Acceptance criteria

  1.  User telemetry page title for all the resource details pages should be changed to resource · tab-name format. Product name should not be part of user telemetry page title
  2. Page title in UI for all the resource details pages should be changed to resource-name . resource . tab-name . Product-name format
  3. User telemetry page title should be non-translated value for tracking purpose
  4. Page title in UI should be translated value

Note:

  • do we need to do anything to be GDPR compliant?

labelKeyForNodeKind now returns translated value, before it used to return label key. So Change method name for labelKeyForNodeKind to getTitleForNodeKind

Description

Update page title to have non-translated title in {resource-name} · {resource} · {tab-name} · OKD format

All page titles of resource details page to be added as a non-translated value in {resource-name} · {resource} · {tab-name} · OKD format inside <title> component as attribute with name for ex, data-title-id and use this value in fireUrlChangeEvent to send it as title for telemetry page event. Refer spike https://issues.redhat.com/browse/ODC-7269 for more details

 

Acceptance Criteria

  1. Add data-title-id attribute for all the resource details page title component 
  2. Use data-title-id as title value while sending URL change event to telemetry 
  3. If data-title-id attribute value is not present in title use page title value

Additional Details:

Refer spike https://issues.redhat.com/browse/ODC-7269 for more details

Description

Change page title for all resource details pages to {resource-name} · {resource} ·  {tab-name} · OKD

Acceptance Criteria

  1. Page title for all the resource details pages should be changed to  {resource-name} · {resource} · {tab-name} · OKD format
  2. If details page does not have tabs, then {tab-name} can be just "Details"
  3. Page title should be translated value

Additional Details:

Need to check all the resource pages which have details page and change the title.

Feature Overview (aka. Goal Summary)  

One of the steps in doing a disconnected environment install is to mirror the images to a designated system. This feature enhances oc-mirror to not handle the multi release payload, that is the payload that contains all the platform images (x86, Arm, IBM Power, IBM Z). This is a key feature towards supporting disconnected installs in a multi-architecture compute i.e. mixed architecture cluster environment.

 

Goals (aka. expected user outcomes)

Customers will be able to use oc-mirror to enable the multi payload in a disconnected environment.

 

Requirements (aka. Acceptance Criteria):

Allow oc-mirror to mirror the multi release payload

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal

  • Add 'oc new-app' support for creating image streams with manifest list support
  • Add 'oc new-build' support for creating image streams with manifest list support

Why is this important?

  • oc commands that create image streams should work correctly on multi-arch clusters

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/IR-289
  2. https://issues.redhat.com/browse/IR-192

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

ACCEPTANCE CRITERIA

  • When creating a build with 'oc new-build' that points to a manifest-listed image, users should be able to set an "--import-mode=" flag to 'PreserveOriginal' to preserve all architectures of the manifest list and let any builder pods build from the manifestlisted image.
  • 'oc new-build' should not cause pods to fail due to being the incorrect architecture 
  • Ensure node scheduling happens properly on a heterogeneous cluster when running 'oc new-build'

 

ImportMode api reference: https://github.com/openshift/api/blob/master/image/v1/types.go#L294

ACCEPTANCE CRITERIA

  • When creating a workload with 'oc new-app' that points to a manifest-listed image, users should be able to set an "--import-mode=" flag to 'PreserveOriginal' in order to preserve all architectures of the manifest list  
  • 'oc new-app --name <name> <manifestlist-image> --import-mode=PreserveOriginal' should not cause pods to fail due to being the incorrect architecture 
  • Ensure node scheduling happens properly on a heterogeneous cluster when running 'oc new-app' with '--import-mode=PreserveOriginal'

 

ImportMode api reference: https://github.com/openshift/api/blob/master/image/v1/types.go#L294

Original issue and discussion: https://coreos.slack.com/archives/CFFJUNP6C/p1664890804998069

 

 

Feature Overview (aka. Goal Summary)  

Having additional MCO metrics is helpful to customers who want to closely monitor the state of their Machines and MachineConfigPools.

 

Requirements (aka. Acceptance Criteria):

Add for each MCP:

    - Paused
    - Updated
    - Updating
    - Degraded
    - Machinecount
    - ReadyMachineCount
    - UpdatedMachineCount
    - DegradedMachineCount

Creating this to version scope the improvements merged into 4.14. Since those changes were in a story, they need an epic.

Customer like to have in Prometheus some metrics of MachineConfigOperator. For each MCP:
    
    - Paused
    - Updated
    - Updating
    - Degraded
    - Machinecount
    - ReadyMachineCount
    - UpdatedMachineCount
    - DegradedMachineCount
   

Why does the customer need this? (List the business requirements here)

These metrics would be really important, as it could show any MachineConfig action (updating, degraded, ...), which could also even trigger an alarm with a PrometheusRule. Having a dashboard of MachineConfig would be also really useful.

 

Feature Overview

Telecommunications providers continue to deploy OpenShift at the Far Edge. The acceleration of this adoption and the nature of existing Telecommunication infrastructure and processes drive the need to improve OpenShift provisioning speed at the Far Edge site and the simplicity of preparation and deployment of Far Edge clusters, at scale.

Goals

  • Simplicity The folks preparing and installing OpenShift clusters (typically SNO) at the Far Edge range in technical expertise from technician to barista. The preparation and installation phases need to be reduced to a human-readable script that can be utilized by a variety of non-technical operators. There should be as few steps as possible in both the preparation and installation phases.
  • Minimize Deployment Time A telecommunications provider technician or brick-and-mortar employee who is installing an OpenShift cluster, at the Far Edge site, needs to be able to do it quickly. The technician has to wait for the node to become in-service (CaaS and CNF provisioned and running) before they can move on to installing another cluster at a different site. The brick-and-mortar employee has other job functions to fulfill and can't stare at the server for 2 hours. The install time at the far edge site should be in the order of minutes, ideally less than 20m.
  • Utilize Telco Facilities Telecommunication providers have existing Service Depots where they currently prepare SW/HW prior to shipping servers to Far Edge sites. They have asked RH to provide a simple method to pre-install OCP onto servers in these facilities. They want to do parallelized batch installation to a set of servers so that they can put these servers into a pool from which any server can be shipped to any site. They also would like to validate and update servers in these pre-installed server pools, as needed.
  • Validation before Shipment Telecommunications Providers incur a large cost if forced to manage software failures at the Far Edge due to the scale and physical disparate nature of the use case. They want to be able to validate the OCP and CNF software before taking the server to the Far Edge site as a last minute sanity check before shipping the platform to the Far Edge site.
  • IPSec Support at Cluster Boot Some far edge deployments occur on an insecure network and for that reason access to the host’s BMC is not allowed, additionally an IPSec tunnel must be established before any traffic leaves the cluster once its at the Far Edge site. It is not possible to enable IPSec on the BMC NIC and therefore even OpenShift has booted the BMC is still not accessible.

Requirements

  • Factory Depot: Install OCP with minimal steps
    • Telecommunications Providers don't want an installation experience, just pick a version and hit enter to install
    • Configuration w/ DU Profile (PTP, SR-IOV, see telco engineering for details) as well as customer-specific addons (Ignition Overrides, MachineConfig, and other operators: ODF, FEC SR-IOV, for example)
    • The installation cannot increase in-service OCP compute budget (don't install anything other that what is needed for DU)
    • Provide ability to validate previously installed OCP nodes
    • Provide ability to update previously installed OCP nodes
    • 100 parallel installations at Service Depot
  • Far Edge: Deploy OCP with minimal steps
    • Provide site specific information via usb/file mount or simple interface
    • Minimize time spent at far edge site by technician/barista/installer
    • Register with desired RHACM Hub cluster for ongoing LCM
  • Minimal ongoing maintenance of solution
    • Some, but not all telco operators, do not want to install and maintain an OCP / ACM cluster at Service Depot
  • The current IPSec solution requires a libreswan container to run on the host so that all N/S OCP traffic is encrypted. With the current IPSec solution this feature would need to support provisioning host-based containers.

 

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts.  If a non MVP requirement slips, it does not shift the feature.

requirement Notes isMvp?
     
     
     

 

Describe Use Cases (if needed)

Telecommunications Service Provider Technicians will be rolling out OCP w/ a vDU configuration to new Far Edge sites, at scale. They will be working from a service depot where they will pre-install/pre-image a set of Far Edge servers to be deployed at a later date. When ready for deployment, a technician will take one of these generic-OCP servers to a Far Edge site, enter the site specific information, wait for confirmation that the vDU is in-service/online, and then move on to deploy another server to a different Far Edge site.

 

Retail employees in brick-and-mortar stores will install SNO servers and it needs to be as simple as possible. The servers will likely be shipped to the retail store, cabled and powered by a retail employee and the site-specific information needs to be provided to the system in the simplest way possible, ideally without any action from the retail employee.

 

Out of Scope

Q: how challenging will it be to support multi-node clusters with this feature?

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>

<Does the Feature introduce data that could be gathered and used for Insights purposes?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

< What does success look like?>

< Does this feature have doc impact?  Possible values are: New Content, Updates to existing content,  Release Note, or No Doc Impact>

< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

  • <What concepts do customers need to understand to be successful in [action]?>
  • <How do we expect customers will use the feature? For what purpose(s)?>
  • <What reference material might a customer want/need to complete [action]?>
  • <Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
  • <What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Interoperability Considerations

< Which other products and versions in our portfolio does this feature impact?>

< What interoperability test scenarios should be factored by the layered product(s)?>

Questions

Question Outcome
   

 

 

Epic Goal

  • Install SNO within 10 minutes

Why is this important?

  • SNO installation takes around 40+ minutes.
  • This makes SNO less appealing when compared to k3s/microshift.
  • We should analyze the  SNO installation, figure our why it takes so long and come up with ways to optimize it

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. https://docs.google.com/document/d/1ULmKBzfT7MibbTS6Sy3cNtjqDX1o7Q0Rek3tAe1LSGA/edit?usp=sharing

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

while trying to figure out why it takes so long to install Single node OpenShift I noticed that the kube-controller-manager cluster operator is degraded for ~5 minutes due to:
GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.119.108:9091: connect: connection refused
I don't understand how the prometheusClient is successfully initialized, but we get a connection refused once we try to query the rules.
Note that if the client initialization fails the kube-controller-manger won't set the  GarbageCollectorDegraded to true.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. install SNO with bootstrap in place (https://github.com/eranco74/bootstrap-in-place-poc)

2. monitor the cluster operators staus 

Actual results:

GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.119.108:9091: connect: connection refused 

Expected results:

Expected the GarbageCollectorDegraded status to be false

Additional info:

It seems that for PrometheusClient to be successfully initialised it needs to successfully create a connection but we get connection refused once we make the query.
Note that installing SNO with this patch (https://github.com/eranco74/cluster-kube-controller-manager-operator/commit/26e644503a8f04aa6d116ace6b9eb7b9b9f2f23f) reduces the installation time by 3 minutes


openshift- service-ca service-ca pod takes a few minutes to start when installing SNO

kubectl get events -n openshift-service-ca --sort-by='.metadata.creationTimestamp' -o custom-columns=FirstSeen:.firstTimestamp,LastSeen:.lastTimestamp,Count:.count,From:.source.component,Type:.type,Reason:.reason,Message:.message                      
FirstSeen              LastSeen               Count   From                                                                                              Type      Reason                 Message
2023-01-22T12:25:58Z   2023-01-22T12:25:58Z   1       deployment-controller                                                                             Normal    ScalingReplicaSet      Scaled up replica set service-ca-6dc5c758d to 1
2023-01-22T12:26:12Z   2023-01-22T12:27:53Z   9       replicaset-controller                                                                             Warning   FailedCreate           Error creating: pods "service-ca-6dc5c758d-" is forbidden: error fetching namespace "openshift-service-ca": unable to find annotation openshift.io/sa.scc.uid-range
2023-01-22T12:27:58Z   2023-01-22T12:27:58Z   1       replicaset-controller                                                                             Normal    SuccessfulCreate       Created pod: service-ca-6dc5c758d-k7bsd
2023-01-22T12:27:58Z   2023-01-22T12:27:58Z   1       default-scheduler                                                                                 Normal    Scheduled              Successfully assigned openshift-service-ca/service-ca-6dc5c758d-k7bsd to master1
 

Seems that creating the serivce-ca namespace early allows it to get
openshift.io/sa.scc.uid-range annotation and start running earlier, the
service-ca pod is required for other pods (CVO and all the control plane pods) to start since it's creating the serving-cert 

  • I'm not sure this is a CVO issue, but I think CVO is the one creating the namespace, CVO also renders some manifests during bootkube so it seems like the right component.

Description of problem:

The bootkube scripts spend ~1 minute failing to apply manifests while waiting fot eh openshift-config namespace to get created

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1.Run the POC using the makefile here https://github.com/eranco74/bootstrap-in-place-poc
2. Observe the bootkube logs (pre-reboot) 

Actual results:

Jan 12 17:37:09 master1 cluster-bootstrap[5156]: Failed to create "0000_00_cluster-version-operator_01_adminack_configmap.yaml" configmaps.v1./admin-acks -n openshift-config: namespaces "openshift-config" not found
....
Jan 12 17:38:27 master1 cluster-bootstrap[5156]: "secret-initial-kube-controller-manager-service-account-private-key.yaml": failed to create secrets.v1./initial-service-account-private-key -n openshift-config: namespaces "openshift-config" not found

Here are the logs from another installation showing that it's not 1 or 2 manifests that require this namespace to get created earlier:

Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-ca-bundle-configmap.yaml": failed to create configmaps.v1./etcd-ca-bundle -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-client-secret.yaml": failed to create secrets.v1./etcd-client -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-metric-client-secret.yaml": failed to create secrets.v1./etcd-metric-client -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-metric-serving-ca-configmap.yaml": failed to create configmaps.v1./etcd-metric-serving-ca -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-metric-signer-secret.yaml": failed to create secrets.v1./etcd-metric-signer -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-serving-ca-configmap.yaml": failed to create configmaps.v1./etcd-serving-ca -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "etcd-signer-secret.yaml": failed to create secrets.v1./etcd-signer -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "kube-apiserver-serving-ca-configmap.yaml": failed to create configmaps.v1./initial-kube-apiserver-server-ca -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "openshift-config-secret-pull-secret.yaml": failed to create secrets.v1./pull-secret -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "openshift-install-manifests.yaml": failed to create configmaps.v1./openshift-install-manifests -n openshift-config: namespaces "openshift-config" not found
Jan 12 17:38:10 master1 bootkube.sh[5121]: "secret-initial-kube-controller-manager-service-account-private-key.yaml": failed to create secrets.v1./initial-service-account-private-key -n openshift-config: namespaces "openshift-config" not found

Expected results:

expected resources to get created successfully without having to wait for the namespace to get created.

Additional info:

 

Feature Goal

  • Definition of a CU Profile
  • Deployment of the CU profile on multi-node Bare Metal clusters using the RH declarative framework.

Why is this important?

  • Telcos will want minimal hands-on installs of all infrastructure.

Requirements

  1. CU infrastructure deployment and life-cycle management must be performed through the ZTP workflow toolset (SiteConfig, PolicyGen, ACM and ArgoCD)
  2. Performance tuning:
    • Non-RT kernel
    • Huge pages set per NUMA
  3. Day 2 operators:
    • SR-IOV network operator and sample configuration
    • OCS / ODF sample configuration, highly available storage
    • Cluster logging operator and sample configuration
  4. Additional features
    • Disk encryption (which?)
    • SCTP
    • NTP time synchronization
    • IPV4, IPV6 and dual stack options

Scenarios

  1. CU on a Three Node Cluster - zero touch provisioning and configuration
  2. CU can be on SNO, SNO+1 worker or MNO (up to 30 nodes)
  3. Cluster expansion
  4. y-stream and z-stream upgrade
  5. in-service upgrade (progressively update cluster)
  6. EUS to EUS upgrade

Acceptance Criteria

  • Reference configurations released as part of ZTP
  • Scenarios validated and automated (Reference Design Specification)
  • Lifecycle scenarios are measured and optimized
  • Documentation completed

Open questions::

  1. What kind of disk encryption is required?
  2. Should any work be done on ZTP cluster expansion?
  3. What KPIs must be met? CaaS CPU/RAM/disk budget KPIs/targets? Overall upgrade time, cluster downtime, number of reboot per node type targets? oslat/etc targets?

References:

  1. RAN DU/CU Requirements Matrix
  2. CU baseline profile 2020
  3. CU profile - requirements
  4. Nokia blueprints

https://docs.google.com/document/d/13Db7uChVx-2JXqAMJMexzHbhG3XLNLRy9nZ_7g9WbFU/edit#

Epic Goal

* Enable setting node labels on spoke cluster during installation

  • Right now we need to add roles, need to check if additional labels are required

Why is this important?

Scenarios

  1. ZTP flow user would like to mark nodes with additional roles, like rt, storage etc, in addition to master/worker that we have right now and supported by default

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Open questions::

  1. How master/worker roles are getting to the nodes, maybe we can use the same flow?
  2. Do we need to support only roles or in general supply labels?
  3. Another alternative is to use https://github.com/openshift/assisted-service/blob/d1cde6d398a3574bda6ce356411cba93c74e1964/swagger.yaml#L4071, a remark is that this will work only for day1

 Modify the scripts in assisted-service/deploy/operator/ztp.
The following environment variables will be added:

MANIFESTS: JSON containing the manifests to be added for day1 flow.  The key is the file name, and the value is the content.

NODE_LABELS: Dictionary of dictionaries.  The Outer dictionary key is the node name and the value is the node labels (key, value) to be applied.

MACHINE_CONFIG_POOL: Dictionary of strings.  The key is the node name and the value is machine config pool name.

SPOKE_WORKER_AGENTS: Number of worker nodes to be added as part of day1 installation.  Default 0

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Collect on-prem installation data in order to be able to structure similar ELK dashboards as from SaaS deployments
  • Collect info of ZTP/CIM deployments
  • Collect info of BILLI deployements

Why is this important?

  • We want to track trends, and be able to analyze on-prem installations

Scenarios

  1. As a cluster administrator, I can provision and manage my fleet of clusters knowing that every data point is collected and sent to the Assisted Installer team without having to do anything extra. I know my data will be safe and secure and the team will only collect data they need to improve the product.
  2. As a developer on the assisted installer team, I can analyze the customer data to determine if a feature is worth implementing/keeping/improving. I know that the customer data is accurate and up-to-date. All of the data is parse-able and can be easily tailored to the graphs/visualizations that help my analysis.
  3. As a product owner, I can determine if the product is moving in the right direction based on the actual customer data. I can prioritize features and bug fixes based on the data.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. [Internal] MGMT-11244 Decision for which event streaming service used will determine the endpoint we send the data to

Previous Work (Optional):

 

 MGMT-11244: Remodeling of SaaS data pipeline

 

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
  1. Query assisted service events for this cluster when a cluster reaches an "end state"
    1. End states include when the cluster is in state `error`, `cancelled`, `installed`
  2. Authenticate and send data to data streaming service

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • We need new api that will allow us to skip cluster/host validations.
  • This api should have it's own feature flag. 

Why is this important?

  • Some customer and partners has very specific HW that doesn't pass our validations and we want to allow them to it
  • Sometimes we have bugs in our validations that block people from installing and we don't want our partners to stuck cause of us

Scenarios

  1. Example from kaloom:
    1. Kaloom has very specific setup where vips can be shown as busy though installation can proceed with them.
    2. Currently they need to override vips in install config to be able to install cluster
    3. After adding the new api they can just run it and skip this specific validation.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Feature flag for this api should be added to statistics calculator and if it was set cluster failure should not be counted.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of the problem:

Documentation for ignore validation API should be updated with the correct json string arrays:

  • JSON string arrays are (L53 and L62):
{ "ignored_host_validations": "[\"all\"]" "ignored_cluster_validations": "[\"all\"]" }

While it should be :

{ "host-validation-ids": "[\"all\"]", "cluster-validation-ids": "[\"all\"]" }

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Description of the problem:

In BE 2.16.0 Staging - while cluster is in installed or installing state, ignore validation API changes the validations, but this should be blocked.

How reproducible:

100%

Steps to reproduce:

1. send this call to installed cluster

curl -i -X PUT 'https://api.stage.openshift.com/api/assisted-install/v2/clusters/${cluster_id}/ignored-validations'   --header "Authorization: Bearer $(ocm token)"   -H 'accept: application/json'   -H 'Content-Type: application/json' -d '{"host-validation-ids": "[\"all\"]", "cluster-validation-ids": "[\"all\"]"}'
 

2. Cluster validation is changed

3.

Actual results:

 

Expected results:

Description of the problem:

In staging, BE 2.17.0 - Ignore validation API has no validation for the values sent. For example:

curl -X 'PUT' 'https://api.stage.openshift.com/api/assisted-install/v2/clusters/be4cdbef-7ea6-48f6-a30a-d1169eeb38fb/ignored-validations'   --header "Authorization: Bearer $(ocm token)"   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
  "host-validation-ids": "[\"testTest\",\"HasCPUCoresForRole\"]",
  "cluster-validation-ids": "[]"       
}'

Stores:

 {"cluster-validation-ids":"[]","host-validation-ids":"[\"testTest\",\"HasCPUCoresForRole\"]"}

How reproducible:

100%

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

to help customers debugging we need to able to collect include noo pods and resource in the collected must-gather script 

to collect it use 

oc adm must-gather

User Story

As a developer i want to have my testing and build tooling managed in a consistent way for reduce amount of context switches during doing a maintenance work. 

Background

Currently our approach to manage and update auxiliary tooling (such as envtest, controller-gen, etc) is inconsistent. Fine pattern was introduced in CPMS repo, which relies on golang toolchain for update, vendor and run this auxiliary tooling.

For CPMS context see: 

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/tools/tools.go

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/go.mod#L24

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/Makefile#L19

Steps

  • Align envtest, controller-gen and other tooling management with pattern introduced within CPMS repo
  • Introduce additional test which compares envtest version (if envtest is in use within particular repo) with using k8s related libraries version. This will help to not forget to update envtest and other aux tools during dependency bumps.

Stakeholders

  • Cluster infra team

Definition of Done

  • All Cluster Infra Team owned repos updated and uses consistent pattern for auxiliary tools management
    • REPO LIST TBD, raw below
    • MAPI providers
    • MAO
    • CCCMO
    • CMA
  • Testing
  • Existing tests should pass
  • additional test for checking envtest version should be introduced

Background

Currently our approach to manage and update auxiliary tooling (such as envtest, controller-gen, etc) is inconsistent. Fine pattern was introduced in CPMS repo, which relies on golang toolchain for update, vendor and run this auxiliary tooling.

For CPMS context see:

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/tools/tools.go

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/go.mod#L24

https://github.com/openshift/cluster-control-plane-machine-set-operator/blob/main/Makefile#L19

Steps

  • Align envtest, controller-gen and other tooling management with pattern introduced within CPMS repo
  • Introduce additional test which compares envtest version (if envtest is in use within particular repo) with using k8s related libraries version. This will help to not forget to update envtest and other aux tools during dependency bumps.
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The Kube APIServer has a sidecar to output audit logs. We need similar sidecars for other APIServers that run on the control plane side. We also need to pass the same audit log policy that we pass to the KAS to these other API servers.

The goal of this EPIC is to solve several issues related to PDBs

 
 
 
 

 

 causing issues during OCP upgrades, especially when new apiservers (which is rolling one by one) were wedged (there was some issue with networking on new pods due to rhel upgrades)

 

slack thread: https://redhat-internal.slack.com/archives/CC3CZCQHM/p1673886138422059
 
 
 

 

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.

The cluster ID can change in some corner cases (see CCXDEV-10284), but the "insights client" in the Insights Operator caches the cluster ID (see here) used when uploading the data in "user-agent" header in memory . This can lead to a situation when the cluster is reporting (to Insights) with stale/old cluster ID.

Epic Goal

Remove code that was added thought the ACM integration into all of the console's codebase repositories

Why is this important?

Since there was decision made stop with the ACM integration, we as a team decided that it would be better to remove the unused code in order avoid any confusion or regressions.

Acceptance Criteria

  • Identify all the places from which we need to remove the code that was added during the ACM integration.
  • Come up with a plan how to remove the code from our repositories and CI
  • Remove the code from console-operator repoy
  • Start with code removal from the console repository

Scour through the console repo and mark all multicluster-related code for removal by adding a "TODO remove multicluster" comment.

 

AC:

  • All multicluster-related console code is marked with a "TODO remove multicluster" comment.

Place holder epic to track spontaneous task which does not deserve its own epic.

AC:

We have connectDirectlyToCloudAPIs flag in konnectiviy socks5 proxy to dial directly to cloud providers without going through konnectivity.

This introduce another path for exception https://github.com/openshift/hypershift/pull/1722

We should consolidate both by keep using connectDirectlyToCloudAPIs until there's a reason to not.

 

ServicePublishingStrategy of type LoadBalancer or Route could specify the same hostname, which will result on one of the services not being published. i.e. no DNS records created.
context: https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1678287502260289
 
DOD:
Validate ServicePublishingStrategy and report conflicting services hostnames

Most of our conditions status is driven by programatic output of reconciliation loops.

E.g: the HostedCluster available

  • depends on kas, etcd and infra conditions.
  • For kas/etcd we check the Deployment/stateful resource healthy

This is a good signal for day 1, but we might be missing relevant real state of the world for day 2. E.g:

  • Do we flip HCAvailable condition if the our ingress controller is deleted/unhealthy.
  • Do we flip HCAvailable condition if a Route resource is deleted?
  • Do we flip HCAvailable condition if the LB is deleted out of band?

DoD:

Reproduce and review behaviour the examples above.

Consider adding additional knowledge for computing the HCAvailable condition. Health check on expected day 2 holistic e2e behaviour rather than in particular status of subcomponents.

E.g. actually query the kas through the url we expose

The Hypershift operator deployment fails when we try to deploy it in the RootCI server which has the PSA enabled. So we need to make the hypershift operator deployment restricted PSA compliant

Event:

0s          Warning   FailedCreate        replicaset/operator-66cc5794c9       (combined from similar events): Error creating: pods "operator-66cc5794c9-k2sq7" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "operator" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "operator" must set securityContext.capabilities.drop=["ALL"]), seccompProfile (pod or container "operator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") 

DoD:

This feature is supported by ROSA.

To have an e2e to validate publicAndPrivate <-> Private in the presubmits.

This is a placeholder to capture the necessary CI changes to do every release cut.

There are a few places in our CI config which requires pinning to the new release every release cut:

DOD:

Make sure we have this documented in hypershift repo and that all needed is done for current release branch.

Once the HostedCluster and NodePool gets stopped using PausedUntil statement, the awsprivatelink controller will continue reconciling.

 

How to test this:

  • Deploy a private cluster
  • Put it in pause once deployed
  • Delete the AWSEndPointService and the Service from the HCP namespace
  • And wait for a reconciliation, the result it's that they should not be recreated
  • Unpause it and wait for recreation.

DoD:

If change a NodePool from having .replicas to autoscaler min/Max and set a min beyond the current replicas, that might leave the machineDeployment in a state not suitable to be autoscalable. This require the consumer to ensure the min is <= current replicas which is poor UX. We should be able to automate this ideally

OCP components could change their image key in the release payload, which might not be immediately visible to us and would break Hypershift. 

 
DOD:
Validate release contains all the images required by Hypershift and report missing images in a condition

DoD:

At the moment if the input etcd kms encryption (key and role) is invalid we fail transparently.

We should check that both key and role are compatible/operational for a given cluster and fail in a condition otherwise

AWS has a hard limit of 100 OIDC providers globally. 
Currently each HostedCluster created by e2e creates its own OIDC provider, which results in hitting the quota limit frequently and causing the tests to fail as a result.

 
DOD:
Only a single OIDC provider should be created and shared between all e2e HostedClusters. 

Epic Goal

As an OpenShift infrastructure owner, I want to use the Zero Touch Provisioning flow with RHACM, where RHACM is in a dual-stack hub cluster and the deployed cluster is an IPv6-only cluster.

Why is this important?

Currently ZTP doesn't work when provisioning IPv6 clusters from a dual-stack hub cluster. We have customers who aim to deploy new clusters via ZTP that don't have IPv4 and work exclusively over IPv6. To enable this use case work on the metal platform has been identified as a requirement.

Dependencies

Converge IPI and ZTP Boot Flows: METAL-10

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

     

 

Epic Goal

  • Currently, we are polling events from assisted service, enriching the events, and pushing it to elastic in event scrape service.
    In order to support also sending events from On-Prem environments - we need to remodel the data pipelines towards push-based model. Since we'll benefit from this approach in SaaS environment as well, we'll seek for a model as unified as possible

Why is this important?

  • Support on-prem environments
  • Increase efficiency (we'll stop performing thousands of requests per minute to the SaaS)
  • Enhance resilience (right now if something fails, we have a relatively short time window to fix it before we lose data)

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. Make a decision on what design to implement (internal)
  2. Authorization with pull-secret (TBD, is there a ticket for this? Oved Ourfali )
  3. RH Pipelines RHOSAK implementation

Previous Work (Optional):

  1. First analysis
  2. We then discussed the topic extensively: Riccardo Piccoli Igal Tsoiref Michael Levy liat gamliel Oved Ourfali Juan Hernández 
  3. We explored already existing systems that would support our needs, and we found that RH Pipelines almost exactly matches them:
  • Covers auth needed from on prem to the server
  • Accepts HTTP-based payload and files to be uploaded (very handy for bulk upload from on-prem)
  • Lacks routing: limits our ability to scale data processing horizontally
  • Lacks infinite data retention: the original design has kafka infinite retention as key characteristic
  1. We need to evaluate requirements and options we have to implement the system. Another analysis with a few alternatives

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Roadmap

  • Stream events from service to kafka
  • Enable feature flag hiding this feature in staging to gather data
  • Read events and project them to elasticsearch
  • Process on-prem events and re-stream them into the kafka stream
  • Adapt CCX export

 We are missing event notifications on creation of some resources. We need to make sure they are notified

1. Proposed title of this feature request

Delete worker nodes using GitOps / ACM workflow

2. What is the nature and description of the request?

We use siteConfig to deploy a cluster using the GitOPS / ACM workflow. We can also use siteConfig to add worker nodes to an existing cluster. However, today we cannot delete a worker node using the GitOps / ACM work flow. We need to go and manually delete the resources (BMH, nmstateConfig etc.) and the OpenShift node. We would like to have the node deleted as part of the GitOps workflow.

3. Why does the customer need this? (List the business requirements here)

Worker nodes may need to be replaced for any reason (hardware failures) which may require deletion of a node.

If we are colocating OpenShift and OpenStack control planes on the same infrastructure (using OpenStack director operator to create OpenStack control plane in OCP virtualization), then we also have the use case of assigning baremetal nodes as OpenShift worker nodes or OpenStack compute nodes. Over time we may need to change the role of those baremetal nodes (from worker to compute or from compute to worker). Having the ability to delete worker nodes via GitOps will make it easier to automate that use case.

4. List any affected packages or components.

ACM, GitOps

In order to cleanly remove a node without interrupting existing workloads it should be cordoned and drained before it is powered off.

This should be handled by BMAC and should not interrupt processing of other requests. The best implementation I could find so far is in the kubectl code, but using that directly is a bit problematic as the call waits for all the pods to be stopped or evicted before returning. There is a timeout, but then we have to either give up after one call and remove the node anyway, or track multiple calls to drain across multiple reconciles.

We should come up with a way to drain asynchronously (maybe investigate what CAPI does).

We should allow for users to control removing the spoke node using resources on the hub.

For the ZTP-gitops case, this needs to be the BMH as they are not aware of the agent resource.

The user will add an annotation to the BMH to indicate that they want us to manage the lifecycle of the spoke node based on the BMH. Then, when the BMH is deleted we will clean the host and remove it from the spoke cluster.

Epic Goal

  • Implement pagination for the events (API ref).

Why is this important?

  • The number of events fetched by clients could be very long, and fetching all of them in a long-polling loop impacts negatively their performance.

Considerations

  • Features in current UI design
  • We must define the semantics of the "Filter by text" field. Right now it executes the filtering on the client.
    Once we'll have the pagination in place, do we want this field to be used for filtering only entries on the active page, similar to what it does today, or should it execute a query so the filtering is performed on the BE?
  • API should contain information about the number of pages available given the number of entries per page the user would like to see.
  • API should return the current page number.
  • Link to a Patternfly Table demo for reference on what data 

Description of the problem:

In staging, UI 2.19.6 - In new cluster events - number of events is shown as "1-10 of NaN" instead of the real number

How reproducible:

100%

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Epic Goal

  • Today we have an API for feature support per OCP version but we don't have an API for feature support per architecture.
    For example, ODF is not supported on ARM so we "hard-coded" blocked it in the UI and we will return a Bad request if the user will ask it using the API.
    Now that we have more architectures such as pppc64le and s390x - it becomes more complicated.

Why is this important?

  • We would like to use the same API for both BE & UI that we can maintain instead of hard-coded limitations in the UI per architecture 

Scenarios

  1. We have 4 architectures: x86, arm, s390, ppc64
  2. We have a few features for each architecture:
    1. Static IP
    2. UMN
    3. Dual stack
    4. OLM operators: LVMS/ ODF/ CNV/ LSO/ MCE
    5. Platform type: Vsphere, Nutanix
    6. Disk encryption
    7. CMN
    8. SNO
    9. heterogeneous clusters 
       

Acceptance Criteria

  • Test each feature and architecture combination both via UI & API.
  •  

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. we have this internal doc: https://docs.google.com/spreadsheets/d/1RmU5cMoQgN-5Rk5i13nwrRoqXv3nDZ4uo65PXQnsKNc/edit#gid=0 
  2. we have OpenShift Multi Architecture Component Availability Matrix page

Open questions::

Done Checklist

  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • UI  

Description of the problem:

 

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Description of the problem:

Method return empty object when calling GET v2/support-levels/features?openshift_version=X

How reproducible:

Call GET v2/support-levels/features?openshift_version=4.13

Steps to reproduce:

1. Call GET v2/support-levels/features?openshift_version=4.13

2.

3.

Actual results:

{}

Expected results:

{   FEATURE_A: supported,   FEATURE_B supported ... }

Description of the problem:

Returning bad request on feature-support validation is colliding with multi platform feature. 

Whenever the user set the CPU architecture to P or Z the platform changed to multi causing loose of information and not failing the cluster registration/update

 

How reproducible:

Register a cluster with s390x as CPU architecture on OCP version 4.12 

 

Expected results:

Bad Request 

Description of the problem:

BE 2.17.4 - (using API calls) creating new cluster, PATCH it with OLM opertors and then create new infra-env  with P/Z should be blocked, but is allowed

How reproducible:

100%

Steps to reproduce:

1. Create new cluster

 curl -X 'POST' \
   'https://api.stage.openshift.com/api/assisted-install/v2/clusters/' \
   --header "Authorization: Bearer $(ocm token)" \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
     "name": "s390xsno2413",
     "high_availability_mode": "Full",
     "openshift_version": "4.13",
     "pull_secret": "${pull_secret}",
 "base_dns_domain": "redhat.com",
     "cpu_architecture": "s390x",
     "disk_encryption": {
         "mode": "tpmv2",
         "enable_on": "none"
     },
     "tags": "",
 "user_managed_networking": true
 }'

2. Patch with OLM operators

curl -i -X 'PATCH'   'https://api.stage.openshift.com/api/assisted-install/v2/clusters/c05ba143-cf22-44ec-b1fd-edad5d8ca5a9'   --header "Authorization: Bearer $(ocm token)"   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
    "olm_operators":[{"name":"cnv"},{"name":"lso"},{"name":"odf"}]
}'

3. Create infra-env

curl -X 'POST'   'https://api.stage.openshift.com/api/assisted-install/v2/infra-envs'   --header "Authorization: Bearer $(ocm token)"   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
    "name": "tests390xsno_infra-env2",
    "pull_secret": "${pull_secret}",
    "cluster_id": "c05ba143-cf22-44ec-b1fd-edad5d8ca5a9",
    "openshift_version": "4.13",
    "cpu_architecture": "s390x"
}' 

Actual results:

Infra-env created

Expected results:
Should be blocked

Create single place on assisted-service (update/register cluster) where we will return bed request in case that feature combination is not supported

Description of the problem:

Currently installing ppc64le cluster with Cluster Managed Networking enabled and Minimal ISO are not supported.

 

Steps to reproduce:

1. Create ppc64le cluster with UMN enabled 

 

Actual results:

BadRequest

 

Expected results:

Created successfully 

Feature goal (what are we trying to solve here?)

vSphere platform configuration is a bit different on OCP 4.13.

Changes needed:

DoD (Definition of Done)

  • Updated install-config without any deprecated parameter
  • Update the post installation guide

Does it need documentation support?

Yes

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Name of the customer(s)
    • How many customers asked for it?
    • Can we have a follow-up meeting with the customer(s)?

 

  • A solution architect asked for it

    • Name of the solution architect and contact details
    • How many solution architects asked for it?
    • Can we have a follow-up meeting with the solution architect(s)?

 

  • Internal request

    • Who asked for it?

 

  • Catching up with OpenShift

Reasoning (why it’s important?)

  • Please describe why this feature is important
  • How does this feature help the product?

Competitor analysis reference

  • Do our competitors have this feature?
    • Yes, they have it and we can have some reference
    • No, it's unique or explicit to our product
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist anywhere
  • Related data - the feature doesn’t exist but we have info about the usage of associated features that can help us
    • Please list all related data usage information
  • We have the numbers and can relate to them
    • Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Manage the effort for adding jobs for release-ocm-2.8 on assisted installer

https://docs.google.com/document/d/1WXRr_-HZkVrwbXBFo4gGhHUDhSO4-VgOPHKod1RMKng

 

Merge order:

  1. Add temporary image streams for Assisted Installer migration - day before (make sure images were created)
  2. Add Assisted Installer fast forwards for ocm-2.x release <depends on #1> - need approval from test-platform team at https://coreos.slack.com/archives/CBN38N3MW 
  3. Branch-out assisted-installer components for ACM 2.(x-1) - <depends on #1, #2> - At the day of the FF
  4. Prevent merging into release-ocm-2.x - <depends on #3> - At the day of the FF
  5. Update BUNDLE_CHANNELS to ocm-2.x on master - <depends on #3> - At the day of the FF
  6. ClusterServiceVersion for release 2.(x-1) branch references "latest" tag <depends on #5> - After  #5
  7. Update external components to AI 2.x <depends on #3> - After a week, if there are no issues update external branches
  8. Remove unused jobs - after 2 weeks

 

Epic Description

This is the second part of Customizations for Node Exporter, following https://issues.redhat.com/browse/MON-2848
There are the following tasks remaining:

  • On/off switch for these collectors:
    • systemd
    • hwmon
    • mountstats (pending decision which metrics to collect)
    • ksmd
  • General options for node-exporter
    • maxprocs

 

The "mountstats" collector generates 53 high-cardinality metrics by default, we have to refine the story to choose only the necessary metrics to collect.

 

Cluster Monitoring Operator uses the configmap "cluster-monitoring-config" in the namespace "openshift-monitoring" as its configuration. These new configurations will be added into the section  "nodeExporter".

Node Exporter comes with a set of default activated collectors and optional collectors.

To simplify the configuration, we put a config object for each collector that we allow users to activate or deactivate.

If a collector is not present, no change is made to its default on/off status. 

Each collector has a field "enabled" as a on/off switch. If "enabled" is set to "false", other fields can be omitted.

The default value for the new options are:

  • collectors
    • systemd
      • enabled: bool, default: false
    • hwmon
      • enabled: bool, default: true
    • mountstats
      • enabled: bool, default: false
    • ksmd
      • enabled: bool, default: false
  • maxProcs: int, default: 0

Here is an example of what these options look like in CMO configmap:

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |
    maxProcs: 4 
    nodeExporter: 
      collectors: 
        hwmon: 
          enabled: true
        mountstats: 
          enabled: true
        systemd: 
          enabled: true
        ksmd: 
          enabled: true


 

If the config for nodeExporter is omitted, Node Exporter should run with the same arguments concerning collectors as those in CMO v4.12:

 
--no-collector.wifi
--collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run/k3s/containerd/.+|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
--collector.netclass.ignored-devices=^(veth.*|[a-f0-9]{15}|enP.*|ovn-k8s-mp[0-9]*|br-ex|br-int|br-ext|br[0-9]*|tun[0-9]*)$
--collector.netdev.device-exclude=^(veth.*|[a-f0-9]{15}|enP.*|ovn-k8s-mp[0-9]*|br-ex|br-int|br-ext|br[0-9]*|tun[0-9]*)$
--collector.cpu.info
--collector.textfile.directory=/var/node_exporter/textfile
--no-collector.cpufreq
--no-collector.tcpstat
--no-collector.cpufreq
--no-collector.tcpstat
--collector.netdev
--collector.netclass
--no-collector.buddyinfo
--collector.netdev
--collector.netclass
--no-collector.buddyinfo

 

 

 

Node Exporter has been upgraded to 1.5.0.
The default value of argument `--runtime.gomaxprocs` is set to 1, different from the old behavior. Node Exporter used to take advantage of multiple processes to accelerate metrics collection.
We are going to add a parameter to set the argument `--runtime.gomaxprocs` and make its default value 0. So that CMO retains the old behavior while allowing users to tune the multiprocess settings of Node Exporter.

The CMO config will have a new section `nodeExporter`, under which there is the parameter `maxProcs`, accepting an integer number as the maximum number of process Node Exporter runs concurrently. Its default value is 0 if omitted.

 config.yaml: |

    nodeExporter: 
      maxProcs: 1

Epic Goal

Why is this important?

Scenarios
1. …

Acceptance Criteria

  • (Enter a list of Acceptance Criteria unique to the Epic)

Dependencies (internal and external)
1. …

Previous Work (Optional):
1. …

Open questions::
1. …

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • Release Enablement: <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
  • QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
  • QE - Automated tests merged: <link or reference to automated tests>
  • QE - QE to verify documentation when testing
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

When resources running short in management cluster when we deploy new apps, which evicts the cloud-controller-manager pod in existing HC's control plane.

 

Getting below error while deleting infra with failed powervs instance

 

Failed to destroy infrastructure        {"error": "error in destroying infra: provided cloud instance id is not in active state, current state: failed"}

 

Also need to take care of create infra process in case of powervs instance goes to failed state. Looping on printing same statement while waiting for it to become active.

 

2022-11-11T13:03:01+05:30       INFO    hyp-dhar-osa-2  Waiting for cloud instance to up        {"id": "crn:v1:bluemix:public:power-iaas:osa21:a/c265c8cefda241ca9c107adcbbacaa84:cd743ba9-195b-46ba-951e-639f97f443d2::", "state": "failed"}

https://github.com/openshift/cluster-image-registry-operator/commit/eac9584446660721c5a31f54fd342f01415a8e92

 

With the above commit in 4.13, for powervs platform storage is not handled, which causes cluster image-registry operator to not getting installed. 

 

Need to handle powervs platform here.

Options discussed are to go with pvc with CSI.

If its not feasible, will try to use IBMCOS used by satellite team.

Issue and Design: https://github.com/ovn-org/ovn-kubernetes/blob/master/docs/design/shared_gw_dgp.md 

Upstream PR: https://github.com/ovn-org/ovn-kubernetes/pull/3160 

Document that describes how to use the mgmt port VF rep for hardware offloading: https://docs.google.com/document/d/1yR4lphjPKd6qZ9sGzZITl0wH1r4ykfMKPjUnlzvWji4/edit# 

==========================================================================

After the upstream PR has been merged. We need to find a way to make the user experience configuring the mgmt port VF rep as streamlined as possible. Basic Streaming that we have committed to is improving the config map to only require the DP resource name with the MGMT VF in the pool. Also OVN-K will need to make use of DP resources.

Description of problem:

- Add support for Dynamic Creation Of DPU/Smart-NIC Daemon Sets and Device-Plugin Resources For OVN-K
- DPU/Smart-NIC Daemonsets need a way to be dynamically created via specific node labels
- The config map needs to support device plugin resources (namely SR-IOV) to be used for the management port configuration in OVN-K
- This should enhance the performance of these flows (planned to be GA-ed in 4.14) for Smart-NIC
   5-a: Pod -> NodePort Service traffic (Pod Backend - Same Node)
   4-a: Pod -> Cluster IP Service traffic (Host Backend - Same Node)

Version-Release number of selected component (if applicable):

4.14.0 (Merged D/S) 
https://github.com/openshift/ovn-kubernetes/commit/cad6ed35183a6a5b43c1550ceb8457601b53460b
https://github.com/openshift/cluster-network-operator/commit/0bb035e57ac3fd0ef7b1a9451336bfd133fa8c1e 

How reproducible:

Never been supported in the past.

Steps to Reproduce:

Please follow the documentation on how to configure this on NVIDIA Smart-NICs in OvS HWOL mode.
 - https://issues.redhat.com/browse/NHE-550 

Please also check the OVN-K daemon sets. There should be a new "smart-nic" daemon set for OVN-K.
Please check on the nodes that the interface ovn-k8s-mp0_0 interface exists alongside ovn-k8s-mp0 interface.

Actual results:

Iperf3 performance:
  5-a: Pod -> NodePort Service traffic (Pod Backend - Same Node)    => ~22.5 Gbits/sec
  4-a: Pod -> Cluster IP Service traffic (Host Backend - Same Node) => ~22.5 Gbits/sec

Expected results:

Iperf3 performance:
 5-a: Pod -> NodePort Service traffic (Pod Backend - Same Node)    => ~29 Gbits/sec
 4-a: Pod -> Cluster IP Service traffic (Host Backend - Same Node) => ~29 Gbits/sec
As you can see we can gain an additional 6.5 Gbits/sec performance with these service flows.

Additional info:

https://docs.google.com/spreadsheets/d/1LHY-Af-2kQHVwtW4aVdHnmwZLTiatiyf-ySffC8O5NM/edit#gid=88193790
https://github.com/ovn-org/ovn-kubernetes/pull/3160

User Story

As the OCM team member, I want to provide support for cluster service, and improve the usability and interoperability of Hypershift.

Acceptance Criteria

  • All the things that have to be done for the feature to be ready to
    release.

Default Done Criteria

  • All existing/affected SOPs have been updated.
  • New SOPs have been written.
  • Internal training has been developed and delivered.
  • The feature has both unit and end to end tests passing in all test
    pipelines and through upgrades.
  • If the feature requires QE involvement, QE has signed off.
  • The feature exposes metrics necessary to manage it (VALET/RED).
  • The feature has had a security review.* Contract impact assessment.
  • Service Definition is updated if needed.* Documentation is complete.
  • Product Manager signed off on staging/beta implementation.

Dates

Integration Testing:
Beta:
GA:

Current Status

GREEN | YELLOW | RED
GREEN = On track, minimal risk to target date.
YELLOW = Moderate risk to target date.
RED = High risk to target date, or blocked and need to highlight potential
risk to stakeholders.

References

Links to Gdocs, github, and any other relevant information about this epic.

Goal

This epic has 3 main goals

  1. Improve segment implementation so that we can easily enable additional telemetry pieces (hotjar, etc) for particular cluster types (starting with sandbox, maybe expanding to RHPDS). This will help us better understand where errors and drop off occurs in our trial and workshop clusters, thus being able to (1) help conversion and (2) proactively detect issues before they are "reported" by customers.
  2. Improve telemetry so we can START capturing console usage across the fleet
  3. Additional improvements to segment, to enable proper gathering of user telemetry and analysis

Problem

Currently we have no accurate telemetry of usage of the OpenShift Console across all clusters in the fleet. We should be able to utilize the auth and console telemetry to glean details which will allow us to get a picture of console usage by our customers.

Acceptance criteria

Let's do a spike to validate, and possibly have to update this list after the spike:

Need to verify HOW do we define a cluster Admin -> Listing all namespaces in a cluster? Install operators? Make sure that we consider OSD cluster admins as well (this should be aligned with how we send people to dev perspective in my mind)

Capture additional information via console plugin ( and possibly the auth operator )

  1. Average number of users per cluster
  2. Average number of cluster admin users per cluster
  3. Average number of dev users per cluster
  4. Average # of page views across the fleet
  5. Average # of page views per perspective across the fleet
  6. # of cluster which have disabled the admin perspective for any users
  7. # of cluster which have disabled the dev perspective for any users
  8. # of cluster which have disabled the “any” perspective for any users
  9. # of clusters which have plugin “x” installed
  10. Total number of unique users across the fleet
  11. Total number of cluster admin users across the fleet
  12. Total number of developer users across the fleet

Dependencies (External/Internal):

Understanding how to capture telemetry via the console operator

Exploration:

Note:

We have removed the following ACs for this release:

  1. (p2) Average total active time spent per User in console (per cluster for all users)
    1. per Cluster Admins
    2. per non-Cluster Admins
  2. (p2) Average active time spent in Dev Perspective [implies we can calculate this for admin perspective]
    1. per Cluster Admins
    2. per non-Cluster Admins-
  3. (p3) Average # of times they change the perspective (per cluster for all users)

Description of problem:
With 4.13 we added new metrics to the console (Epic ODC-7171 - Improved telemetry (provide new metrics), that collect different user and cluster metrics.

The cluster metrics include:

  1. which perspectives are customized (enabled, disabled, only available for a subset of users)
  2. which plugins are installed and enabled

These metrics contain the perspective name or plugin name which was unbounded. Admins could configure any perspective and plugin name, also if the perspective or plugin with that name is not available.

Based on the feedback in https://github.com/openshift/cluster-monitoring-operator/pull/1910 we need to reduce the cardinality and limit the metrics to, for example:

  1. perspectives: admin, dev, acm, other
  2. plugins: redhat, demo, other

Version-Release number of selected component (if applicable):
4.13.0

How reproducible:
Always

Steps to Reproduce:
On a cluster, you must update the console configuration, configure some perspectives or plugins and check the metrics in Admin > Observe > Metrics:

avg by (name, state) (console_plugins_info)

avg by (name, state) (console_customization_perspectives_info)

On a local machine, you can use this console yaml:

apiVersion: console.openshift.io/v1
kind: ConsoleConfig
plugins: 
  logging-view-plugin: https://logging-view-plugin.logging-view-plugin-namespace.svc.cluster.local:9443/
  crane-ui-plugin: https://crane-ui-plugin.crane-ui-plugin-namespace.svc.cluster.local:9443/
  acm: https://acm.acm-namespace.svc.cluster.local:9443/
  mce: https://mce.mce-namespace.svc.cluster.local:9443/
  my-plugin: https://my-plugin.my-plugin-namespace.svc.cluster.local:9443/
customization: 
  perspectives: 
  - id: admin
    visibility: 
      state: Enabled
  - id: dev
    visibility: 
      state: AccessReview
      accessReview: 
        missing: 
          - resource: namespaces
            verb: get
  - id: dev1
    visibility: 
      state: AccessReview
      accessReview: 
        missing: 
          - resource: namespaces
            verb: get
  - id: dev2
    visibility: 
      state: AccessReview
      accessReview: 
        missing: 
          - resource: namespaces
            verb: get
  - id: dev3
    visibility: 
      state: AccessReview
      accessReview: 
        missing: 
          - resource: namespaces
            verb: get

And start the bridge with:

./build-backend.sh
./bin/bridge -config ../config.yaml

After that you can fetch the metrics in a second terminal:

Actual results:

curl -s localhost:9000/metrics | grep ^console_plugins

console_plugins_info{name="acm",state="enabled"} 1
console_plugins_info{name="crane-ui-plugin",state="enabled"} 1
console_plugins_info{name="logging-view-plugin",state="enabled"} 1
console_plugins_info{name="mce",state="enabled"} 1
console_plugins_info{name="my-plugin",state="enabled"} 1
curl -s localhost:9000/metrics | grep ^console_customization

console_customization_perspectives_info{name="dev",state="only-for-developers"} 1
console_customization_perspectives_info{name="dev1",state="only-for-developers"} 1
console_customization_perspectives_info{name="dev2",state="only-for-developers"} 1
console_customization_perspectives_info{name="dev3",state="only-for-developers"} 1

Expected results:
Less cardinality, that means, results should be grouped somehow.

Additional info:

As Red Hat, we want to understand the usage of the (dev) console, for that, we want to add new Prometheus metrics (how many users have a cluster, etc.) and collect them later (as telemetry data) via cluster-monitoring-operator.

Eigther the console-operator or cluster-monitoring-operator needs to apply a PrometheusRule to collect the right data and make these later available in Superset DataHat or Tableau.

Goal:

This epic aims to address some of the RFEs associated with the Pipeline user experience.

Why is it important?

Improve the overall user experience when working with OpenShift Pipelines

Acceptance criteria:

  1. Users should be able to visually differentiate between canceled & failed pipelines in the Pipeline metrics tab
  2. Users should be able to see the duration of TaskRuns in the list view, this can be achieved via the Column management feature in the TaskRuns list page
  3. Users should be able to see the duration of TaskRuns in the TaskRun details view
  4. Users should be able to see the PipelineRun duration on the PipelineRun details page
  5. Users should be able to see a list of all PipelineRuns in their project from a PipelineRuns tab in the Dev perspective Pipeline page
  6. Users should be able to easily view webhook informations on Repository details page
  7. Users should be able to easily view webhook information on the summary page

Dependencies (External/Internal):

None

Exploration:

Exploration is available in this Miro board

Description

As a user, I want to see the PipelineRuns present in the current namespace from the Dev perspective

Acceptance Criteria

  1. Should add a PipelineRuns tab after the Pipeline tab on the dev perspective Pipeline page
  2. Should list all the PipelineRuns present in the namespace
  3. Should add the Create PipelineRun option in the  Create action menu

Additional Details:

Description

As a user, I want to see the duration on the details page of PipelineRun and TaskRun

Acceptance Criteria

  1. should show PipelineRun duration on the details page
  2. should show TaskRun duration on the details page

Additional Details:

Description

As a user, I want to manage the column available for the TaskRuns list page

Acceptance Criteria

  1. should provide a manage columns option on the TaskRuns details page
  2. By default, the Duration column should not be present and the user can make it visible by using manage columns option

Additional Details:

Description

As a user, I want to see the information about the cancelled pipeline on the Pipeline metrics page

Acceptance Criteria

  1. should show the cancelled status in a different color in the Pipeline Success Ratio donut chart.

Additional Details:

Description

As a user, I want to see the webhook link and webhook secret on the Repository details page and the webhook link on the Repository summary page

Acceptance Criteria

  1. Should add the webhook link on the Repository details page
  2. should add the webhook secret on the Repository details page
  3. should show the webhook link and secret only if the Repository has been created using the Setup a Webhook option
  4. should add the webhook link on the Repository  summary page   

Additional Details:

We will use this to address tech debt in OLM in the 4.10 timeframe.

 

Items to prioritize are:

CI e2e flakes

 

The client cert/key pair is a way of authenticating that will function even without live kube-apiserver connections so we can collect metrics if the kube-apiserver is unavailable.

During a PerfScale 80 HC test in stage we found that the OBO prometheus monitoring stack was consuming 50G of memory (enough to cause OOMing on the m5.4xlarge instance it was residing on). Additionally, during this time it would also consume over 10 CPU cores. 

Snapshot of the time leading up to (effectively idle) and during the test: https://snapshots.raintank.io/dashboard/snapshot/2K5s0PzaN1U2JE1jrxTPZ5jX0fifBuRC 

As a SRE, I want to have the ability to filter metrics exposed from the Management Clusters.

Context:
RHOBS resources allocated to HCP are scarce. Currently, we push every single metric to the RHOBS instance.
However, in https://issues.redhat.com/browse/OSD-13741, we've identified a subset of metrics that are important to SRE.

The ability to only export those metrics to RHOBS will reduce significantly the cost of monitoring as well as increase our ability to scale RHOBS.

As discussed in this Slack thread, most of the CPU and memory consumption of the OBO operator is caused at scraping time.

The idea here is to make sure hypershift & control-plane-operator operators are no more specify the scrape interval in servicemonitor & podmonitor scrape configs (unless there is a very good reason to do so).

Indeed, when the scrape interval is not specified at scrape config level, the global scrape interval specified at the root of the config is used. This offers the following benefits:

  • The interval can be set for all scrape configs at once.
  • The interval is no more hard-coded in Hypershift code
  • The interval can be set to a higher value.
    This will allow reducing the quantity of data scrapped by Prometheus and consequently lower its memory consumption.
    See next sub-task which will set the global scrape interval to 60 sec

This is part of solution #1 described here.

Epic Goal

OpenShift Container Platform is shipping a finely tuned set of alerts to inform the cluster's owner and/or operator of events and bad conditions in the cluster.

Runbooks are associated with alerts and help SREs take action to resolve an alert. This is critical to share engineering best practices following an incident.

Goal 1: Current alerts/runbooks for hypershift needs to be evaluated to ensure we have sufficient coverage before hypershift hits GA.

Goal 2: Actionable runbooks need to be provided for all alerts therefore, we should attempt to cover as many as possible in this epic.

Goal 3: Continue adding alerts/runbooks to cover existing OVN-K functionality.

This epic will NOT cover refactors needed to alerts/runbooks due to new arch (OVN IC).

Why is this important?

In-order to scale, we (engineering) must share our institutional knowledge.

In-order for SREs to respond to alerts, they must have the knowledge to do so.

SD needs to have actionable runbooks to respond to alerts otherwise, they will require engineering to engage more frequently.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As an administrator of a cluster utilizing AWS STS with a public S3 bucket OIDC provider, I would like a documented procedure with steps that can be followed to migrate to a private S3 bucket with CloudFront Distribution so that I do not have to recreate my cluster.

ccoctl documentation including parameter `--create-private-s3-bucket`: https://github.com/openshift/cloud-credential-operator/blob/a8ee8a426d38cca3f7339ecd0eac88f922b6d5a0/docs/ccoctl.md

Existing manual procedure for configuring private S3 bucket with CloudFront Distribution: https://github.com/openshift/cloud-credential-operator/blob/master/docs/sts-private-bucket.md

https://coreos.slack.com/archives/CE3ETN3J8/p1666174054230389?thread_ts=1665496599.847459&cid=CE3ETN3J8

Goal:

The participation on SPLAT will be:

 

ACCEPTANCE CRITERIA

  • Document created on CCO repo, reviewed, approved by QE and merged
  • KCS/Article created

 

REFERENCES:

Supporting document: https://github.com/openshift/cloud-credential-operator/blob/master/docs/sts.md#steps-to-in-place-migrate-an-openshift-cluster-to-sts

NOTE: we should add that this step is not supported or recommended.

 

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Description of problem:

In the control plane machine set operator we perform e2e periodic tests that check the ability to do a rolling update of an entire OCP control plane.

This is a quite involved test as we need to drain and replace all the master machines/nodes, altering operators, waiting for machines to come up + bootstrap and nodes to drain and move their workloads to others while respecting PDBs, and etcd quorum.

As such we need to make sure we are robust to transient issues, occasionaly slow-downs and network errors.

We have investigated these timeout issues and identified some common culprits that we want to address, see: https://redhat-internal.slack.com/archives/GE2HQ9QP4/p1678966522151799

Description of problem:

Tests Failed.expand_lesslogs in as 'test' user via htpasswd identity provider: Auth test logs in as 'test' user via htpasswd identity provider

 CI-search
Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

The "sufficient-masters-count' failed" test is intermittently failing due to a suspected race condition that causes as duplicate cluster event.

"Cluster validation 'sufficient-masters-count' that used to succeed is now failing"

The aim of this ticket is to ensure that this test does not flake

Please review the following PR: https://github.com/openshift/cluster-kube-scheduler-operator/pull/478

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

When a HostedCluster is configured as `Private`, annotate the necessary hosted CP components (API and OAuth) so that External DNS can still create public DNS records (pointing to private IP resources).

The External DNS record should be pointing to the resource for the PrivateLink VPC Endpoint. "We need to specify the IP of the A record. We can do that with a cluster IP service."

Context: https://redhat-internal.slack.com/archives/C01C8502FMM/p1675432805760719

Description of the problem:

Events search should not be case sensitive 

 

How reproducible:

100%

 

Steps to reproduce:

1. On UI View Cluster Events

2. Enter text on "Filter by text" field. (eg. "success" or "Success" )

 

Actual results:

Events filter is case sensitive. 

See screenshots enclosed

 

Expected results:

Events filter should not be case sensitive

Description of problem:

The MCO must have compatibility in place one OCP version in advance if we want to bump ignition spec version, otherwise downgrades will fail.

This is NOT needed in 4.14, only 4.13

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. None atm, this is preventative for the future
2.
3.

Actual results:

N/A

Expected results:

N/A

Additional info:

 

Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/59

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Agent-tui should show before the installation, but it shows again during the installation and when it quit again, the installation fail to go on.

Version-Release number of selected component (if applicable):

4.13.0-0.ci-2023-03-14-045458

How reproducible:

always

Steps to Reproduce:

1. Make sure the primary check pass, and boot the agent.x86_64.iso file, we can see the agent-tui show before the installation

2. Tracking installation by both wait-for output and console output

3. The agent-tui show again during the installation, wait for the agent-tui quit automatically without any user interruption, the installation quit with failure, and we have the following wait-for output:

DEBUG asset directory: .                           
DEBUG Loading Agent Config...                      
...
DEBUG Agent Rest API never initialized. Bootstrap Kube API never initialized 
INFO Waiting for cluster install to initialize. Sleeping for 30 seconds 
DEBUG Agent Rest API Initialized                   
INFO Cluster is not ready for install. Check validations 
DEBUG Cluster validation: The pull secret is set.  
WARNING Cluster validation: The cluster has hosts that are not ready to install. 
DEBUG Cluster validation: The cluster has the exact amount of dedicated control plane nodes. 
DEBUG Cluster validation: API virtual IPs are not required: User Managed Networking 
DEBUG Cluster validation: API virtual IPs are not required: User Managed Networking 
DEBUG Cluster validation: The Cluster Network CIDR is defined. 
DEBUG Cluster validation: The base domain is defined. 
DEBUG Cluster validation: Ingress virtual IPs are not required: User Managed Networking 
DEBUG Cluster validation: Ingress virtual IPs are not required: User Managed Networking 
DEBUG Cluster validation: The Machine Network CIDR is defined. 
DEBUG Cluster validation: The Cluster Machine CIDR is not required: User Managed Networking 
DEBUG Cluster validation: The Cluster Network prefix is valid. 
DEBUG Cluster validation: The cluster has a valid network type 
DEBUG Cluster validation: Same address families for all networks. 
DEBUG Cluster validation: No CIDRS are overlapping. 
DEBUG Cluster validation: No ntp problems found    
DEBUG Cluster validation: The Service Network CIDR is defined. 
DEBUG Cluster validation: cnv is disabled          
DEBUG Cluster validation: lso is disabled          
DEBUG Cluster validation: lvm is disabled          
DEBUG Cluster validation: odf is disabled          
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Valid inventory exists for the host 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient CPU cores 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient minimum RAM 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient disk capacity 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient CPU cores for role master 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Sufficient RAM for role master 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Hostname openshift-qe-049.arm.eng.rdu2.redhat.com is unique in cluster 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Hostname openshift-qe-049.arm.eng.rdu2.redhat.com is allowed 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Speed of installation disk has not yet been measured 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host is compatible with cluster platform none 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: VSphere disk.EnableUUID is enabled for this virtual machine 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host agent compatibility checking is disabled 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: No request to skip formatting of the installation disk 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: All disks that have skipped formatting are present in the host inventory 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host is connected 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Media device is connected 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: No Machine Network CIDR needed: User Managed Networking 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host belongs to all machine network CIDRs 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host has connectivity to the majority of hosts in the cluster 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Platform PowerEdge R740 is allowed 
WARNING Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host couldn't synchronize with any NTP server 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host clock is synchronized with service 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: All required container images were either pulled successfully or no attempt was made to pull them 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Network latency requirement has been satisfied. 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Packet loss requirement has been satisfied. 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host has been configured with at least one default route. 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Domain name resolution for the api.zniusno.arm.eng.rdu2.redhat.com domain was successful or not required 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Domain name resolution for the api-int.zniusno.arm.eng.rdu2.redhat.com domain was successful or not required 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Domain name resolution for the *.apps.zniusno.arm.eng.rdu2.redhat.com domain was successful or not required 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host subnets are not overlapping 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: No IP collisions were detected by host 7a9649d8-4167-a1f9-ad5f-385c052e2744 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: cnv is disabled 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: lso is disabled 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: lvm is disabled 
DEBUG Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: odf is disabled 
WARNING Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from discovering to insufficient (Host cannot be installed due to following failing validation(s): Host couldn't synchronize with any NTP server) 
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com validation: Host NTP is synced 
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from insufficient to known (Host is ready to be installed) 
INFO Cluster is ready for install                 
INFO Cluster validation: All hosts in the cluster are ready to install. 
INFO Preparing cluster for installation           
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation) 
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: New image status registry.ci.openshift.org/ocp/4.13-2023-03-14-045458@sha256:b0d518907841eb35adbc05962d4b2e7d45abc90baebc5a82d0398e1113ec04d0. result: success. time: 1.35 seconds; size: 401.45 Megabytes; download rate: 312.54 MBps 
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation) 
INFO Cluster installation in progress             
INFO Host openshift-qe-049.arm.eng.rdu2.redhat.com: updated status from preparing-successful to installing (Installation is in progress) 
INFO Host: openshift-qe-049.arm.eng.rdu2.redhat.com, reached installation stage Starting installation: bootstrap 
INFO Host: openshift-qe-049.arm.eng.rdu2.redhat.com, reached installation stage Installing: bootstrap 
INFO Host: openshift-qe-049.arm.eng.rdu2.redhat.com, reached installation stage Failed: failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon registry.ci.openshift.org/ocp/4.13-2023-03-14-045458@sha256:f85a278868035dc0a40a66ea7eaf0877624ef9fde9fc8df1633dc5d6d1ad4e39 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput "...  to initialize single run daemon: error initializing rpm-ostree: Error while ensuring access to kublet config.json pull secrets: symlink /var/lib/kubelet/config.json /run/ostree/auth.json: file exists" 
INFO Cluster has hosts in error                   
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation 
INFO cluster has stopped installing... working to recover installation   

4. During the installation, we had NetworkManager-wait-online.service for a while:
-- Logs begin at Wed 2023-03-15 03:06:29 UTC, end at Wed 2023-03-15 03:27:30 UTC. --
Mar 15 03:18:52 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: Starting Network Manager Wait Online...
Mar 15 03:19:55 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Mar 15 03:19:55 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
Mar 15 03:19:55 openshift-qe-049.arm.eng.rdu2.redhat.com systemd[1]: Failed to start Network Manager Wait Online.

Expected results:

The TUI should only show once before the installation.

Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/271

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/router/pull/453

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

API fields that are defaulted by a controller should document what their default is for each release version.
Currently the field documents that "if empty, subject to platform chosen default", but it does not state what that is.

To fix this, please add, after the platform chosen default prose:
// The current default is XYZ.

This will allow users to track the platform defaults over time from the API documentation.

I would like to see this fixed before 4.13 and 4.14 are released please, it should be pretty quick to fix if we understand what those defaults are.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

While mirror to filesystem, if 429 error is received from registry, the layer is incorrectly flagged as having been mirrored & therefore not picked up by subsequent mirror re-run requests. It gives the impression as mirror to file system in second attempt is successful. However, causing issue while mirroring from filesystem to target registry (Due to missing files)

Version-Release number of selected component (if applicable):

oc version
Client Version: 4.8.42
Server Version: 4.8.14
Kubernetes Version: v1.21.1+a620f50

How reproducible:

When 429 occurs while mirror to file system

Steps to Reproduce:

1. Run mirror to filesystem command : oc image mirror -f mirror-to-filesystem.txt --filter-by-os '.*' -a $REGISTRY_AUTH_FILE --insecure --skip-multiple-scopes --max-per-registry=1 --continue-on-error=true --dir "$LOCAL_DIR_PATH"  

Output: 
info: Mirroring completed in 2h19m24.14s (25.75MB/s)
error: one or more errors occurred 
E.g
error: unable to push <registry>/namespace/<image-name>: failed to retrieve blob <image-digest>: error parsing HTTP 429 response body: unexpected end of JSON input: ""


2. Re Run mirror to filesystem command : oc image mirror -f mirror-to-filesystem.txt --filter-by-os '.*' -a $REGISTRY_AUTH_FILE --insecure --skip-multiple-scopes --max-per-registry=1 --continue-on-error=true --dir "$LOCAL_DIR_PATH"

Output:
info: Mirroring completed in 480ms (0B/s)


3. Run mirror from filesystem command : oc image mirror -f mirror-from-filesystem.txt -a $REGISTRY_AUTH_FILE --from-dir "$LOCAL_DIR_PATH" --filter-by-os '.*' --insecure --skip-multiple-scopes --max-per-registry=1 --continue-on-error=true

Output: 
info: Mirroring completed in 53m5.21s (67.61MB/s)
error: one or more errors occurred
E.g
error: unable to push file://local/namespace/<image-name>: failed to retrieve blob <image-digest>: open /root/local/namespace/<image-name>/blobs/<image-digest>: no such file or directory

 

Actual results:

1) mirror to filesystem first attempt: 

info: Mirroring completed in 2h19m24.14s (25.75MB/s) 
error: one or more errors occurred 
E.g
error: unable to push <registry>/namespace/<image-name>: failed to retrieve blob <image-digest>: error parsing HTTP 429 response body: unexpected end of JSON input: ""

2) mirror to filesystem second attempt: 

info: Mirroring completed in 480ms (0B/s)

 
3) mirror from filesystem to target registry:  

info: Mirroring completed in 53m5.21s (67.61MB/s) 
error: one or more errors occurred 
E.g 
error: unable to push file://local/namespace/<image-name>: failed to retrieve blob <image-digest>: open /root/local/namespace/<image-name>/blobs/<image-digest>: no such file or directory

Expected results:

source image mirror -> to file system and image mirror from file system -> target registry should complete successfully

Additional info:

 

Description of problem:

We merged a change into origin to modify a test so that `/readyz` would be used as the health check path. It turns out this makes things worse because we want to use kube-proxy's health probe endpoint to monitor the node health, and kube-proxy only exposes `/healthz` which is the default path anyway.

We should remove the annotation added to change the path and go back to the defaults.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

Enabling IPSec doesn't result in IPsec tunnels being created

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Deploy & Enable IPSec

Steps to Reproduce:

1.
2.
3.

Actual results:

000 Total IPsec connections: loaded 0, active 0
000  
000 State Information: DDoS cookies not required, Accepting new IKE connections
000 IKE SAs: total(0), half-open(0), open(0), authenticated(0), anonymous(0)
000 IPsec SAs: total(0), authenticated(0), anonymous(0)

Expected results:

Active connections > 0

Additional info:

✘-1 ~/code/k8s-netperf [more-meta L|✚ 4…37⚑ 1] 
06:49 $ oc -n openshift-ovn-kubernetes -c nbdb rsh ovnkube-master-qw4zv \ovn-nbctl --no-leader-only get nb_global . ipsec
true

Extend multus resource collection so that we gather all resources on a per namespace basis with oc adm inspect.
This way, users can create a combined must-gather with all resources in one place.

We might have to revisit this once the reconciler and other changes land in more recent version of multus, but for the time being I think that this is a good change to make that we can also bp to older versions

Description of problem:

The certificates synced by MCO in 4.13 onwards are more comprehensive and correct, and out of sync issues will surface much faster.

See https://issues.redhat.com/browse/MCO-499 for details

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1.Install 4.13, pause MCPs
2.
3.

Actual results:

Within ~24 hours the cluster will fire critical clusterdown alerts

Expected results:

No alerts fire

Additional info:

 

Description of problem:

PipelineRun default template name has been updated in the backend in Pipeline operator 1.10, So we need to update the name in the UI code as well.

 

https://github.com/openshift/console/blob/master/frontend/packages/pipelines-plugin/src/components/pac/const.ts#L9

 

 Some unit tests are flaky because we check timestamps to have changed.

When creation and test happen very quickly, this might seem to not have changed.

https://redhat-internal.slack.com/archives/C014N2VLTQE/p1681827276489839

 

We can fix this by simulating host creation to have happened in the past

Description of problem:

This issue is triggered by the lack of the file "/etc/kubernetes/kubeconfig" in the node, but what i found interesting is the aesthetic error that follows:

2023-01-04T10:56:50.807982171Z I0104 10:56:50.807918   18013 start.go:112] Version: v4.11.0-202212070335.p0.g60746a8.assembly.stream-dirty (60746a843e7ef8855ae00f2ffcb655c53e0e8296)
2023-01-04T10:56:50.810326376Z I0104 10:56:50.810190   18013 start.go:125] Calling chroot("/rootfs")
2023-01-04T10:56:50.810326376Z I0104 10:56:50.810274   18013 update.go:1972] Running: systemctl start rpm-ostreed
2023-01-04T10:56:50.855151883Z I0104 10:56:50.854666   18013 rpm-ostree.go:353] Running captured: rpm-ostree status --json
2023-01-04T10:56:50.899635929Z I0104 10:56:50.899574   18013 rpm-ostree.go:353] Running captured: rpm-ostree status --json
2023-01-04T10:56:50.941236704Z I0104 10:56:50.941179   18013 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:318187717bd19ef265000570d5580ea680dfbe99c3bece6dd180537a6f268f
e1 (410.84.202210061459-0)
2023-01-04T10:56:50.973206073Z I0104 10:56:50.973131   18013 start.go:101] Copied self to /run/bin/machine-config-daemon on host
2023-01-04T10:56:50.973259966Z E0104 10:56:50.973196   18013 start.go:177] failed to load kubelet kubeconfig: open /etc/kubernetes/kubeconfig: no such file or directory
2023-01-04T10:56:50.975399571Z panic: runtime error: invalid memory address or nil pointer dereference
2023-01-04T10:56:50.975399571Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x173d84f]
2023-01-04T10:56:50.975399571Z
2023-01-04T10:56:50.975399571Z goroutine 1 [running]:
2023-01-04T10:56:50.975399571Z main.runStartCmd(2023-01-04T10:56:50.975436752Z 0x2c3da80?, {0x1bc0b3b?, 0x0?, 0x0?})
2023-01-04T10:56:50.975436752Z  /go/src/github.com/openshift/machine-config-operator/cmd/machine-config-daemon/start.go:179 +0x70f
2023-01-04T10:56:50.975436752Z github.com/spf13/cobra.(*Command).execute(0x2c3da80, {0x2c89310, 0x0, 0x0})
2023-01-04T10:56:50.975436752Z  /go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663
2023-01-04T10:56:50.975448580Z github.com/spf13/cobra.(*Command).ExecuteC(0x2c3d580)
2023-01-04T10:56:50.975448580Z  /go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
2023-01-04T10:56:50.975456464Z github.com/spf13/cobra.(*Command).Execute(...)
2023-01-04T10:56:50.975456464Z  2023-01-04T10:56:50.975464649Z /go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:902
2023-01-04T10:56:50.975464649Z k8s.io/component-base/cli.Run(2023-01-04T10:56:50.975472575Z 0x2c3d580)
2023-01-04T10:56:50.975472575Z  /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/component-base/cli/run.go:105 +0x385
2023-01-04T10:56:50.975485076Z main.main()
2023-01-04T10:56:50.975485076Z  /go/src/github.com/openshift/machine-config-operator/cmd/machine-config-daemon/main.go:28 +0x25

Version-Release number of selected component (if applicable):

4.11.20

How reproducible:

Always

Steps to Reproduce:

1. Remove / change the name of the file "/etc/kubernetes/kubeconfig"
2. Delete machine-config-daemon pod
3. 

Actual results:

2023-01-04T10:56:50.973259966Z E0104 10:56:50.973196   18013 start.go:177] failed to load kubelet kubeconfig: open /etc/kubernetes/kubeconfig: no such file or directory
2023-01-04T10:56:50.975399571Z panic: runtime error: invalid memory address or nil pointer dereference

Expected results:

Fatal error
 
 failed to load kubelet kubeconfig: open /etc/kubernetes/kubeconfig: no such file or directory

but no runtime error

Additional info:

https://github.com/openshift/machine-config-operator/blob/92012a837e2ed0ed3c9e61c715579ac82ad0a464/cmd/machine-config-daemon/start.go#L179

It is caused by the power off routine, which initialises last_error to None. The field is later restored, but BMO manages to observe and record the wrong value.

This issue is not trivial to reproduce in the product. You need OCPBUGS-2471 to land first, then you need to trigger the cleaning failure several times. I used direct access to Ironic via CLI to abort cleaning (`baremetal node abort <node name>`) during deprovisioning. After a few attempts you can observe the following in the BMH's status:

status:
  errorCount: 2
  errorMessage: 'Cleaning failed: '
  errorType: provisioning error

The empty message after the colon is a sign of this bug.

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/61

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:

No limitation for Additional certificates UI field

 

How reproducible:

100%

 

Steps to reproduce:

1. create a cluster  

2. On add host select 'Configure cluster-wide trusted certificates'

3. On Additional certificates, paste a big string 

4. Generate Discovery ISO

 

Actual results:

UI send it to the BE

 

Expected results:

There should be a limitation on certificate field

In many cases, the /dev/disk/by-path symlink is the only way to stably identify a disk without having prior knowledge of the hardware from some external source (e.g. a spreadsheet of disk serial numbers). It should be possible to specify this path in the root device hints.
This is fixed by the first commit in the upstream Metal³ PR https://github.com/metal3-io/baremetal-operator/pull/1264

Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/220

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When a (recommended/conditional) release image is provided with --to-image='', the specified image name is not preserved in the ClusterVersion object.

Version-Release number of selected component (if applicable):

 

How reproducible:

100% with oc >4.9

Steps to Reproduce:

$ oc version
Client Version: 4.12.2
Kustomize Version: v4.5.7
Server Version: 4.12.2
Kubernetes Version: v1.25.4+a34b9e9

$ oc get clusterversion/version -o jsonpath='{.status.desired}'|jq
{
  "channels": [
    "candidate-4.12",
    "candidate-4.13",
    "eus-4.12",
    "fast-4.12",
    "stable-4.12"
  ],
  "image": "quay.io/openshift-release-dev/ocp-release@sha256:31c7741fc7bb73ff752ba43f5acf014b8fadd69196fc522241302de918066cb1",
  "url": "https://access.redhat.com/errata/RHSA-2023:0569",
  "version": "4.12.2"
}
$ oc adm release info 4.12.3 -o jsonpath='{.image}'
quay.io/openshift-release-dev/ocp-release@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36
$ skopeo copy docker://quay.io/openshift-release-dev/ocp-release@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36 docker://quay.example.com/playground/release-images
Getting image source signatures
Copying blob 64096b96a7b0 done  
Copying blob 0e0550faf8e0 done  
Copying blob 97da74cc6d8f skipped: already exists  
Copying blob d8190195889e skipped: already exists  
Copying blob 17997438bedb done  
Copying blob fdbb043b48dc done  
Copying config b49bc8b603 done  
Writing manifest to image destination
Storing signatures
$ skopeo inspect docker://quay.example.com/playground/release-images@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36|jq '.Name,.Digest'
"quay.example.com/playground/release-images"
"sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36"
$ oc adm upgrade --to-image=quay.example.com/playground/release-images@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36 Requesting update to 4.12.3                               
 

Actual results:

$ oc get clusterversion/version -o jsonpath='{.status.desired}'|jq
{
  "channels": [
    "candidate-4.12",
    "candidate-4.13",
    "eus-4.12",
    "fast-4.12",
    "stable-4.12"
  ],
  "image": "quay.io/openshift-release-dev/ocp-release@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36",    <--- not quay.example.com
  "url": "https://access.redhat.com/errata/RHSA-2023:0728",
  "version": "4.12.3"
}

$ oc get clusterversion/version -o jsonpath='{.status.history}'|jq
[
  {
    "completionTime": null,
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36",         <--- not quay.example.com
    "startedTime": "2023-04-28T07:39:11Z",
    "state": "Partial",
    "verified": true,
    "version": "4.12.3"
  },
  {
    "completionTime": "2023-04-27T14:48:06Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:31c7741fc7bb73ff752ba43f5acf014b8fadd69196fc522241302de918066cb1",
    "startedTime": "2023-04-27T14:24:29Z",
    "state": "Completed",
    "verified": false,
    "version": "4.12.2"
  }
]

Expected results:

$ oc get clusterversion/version -o jsonpath='{.status.desired}'|jq
{
  "channels": [
    "candidate-4.12",
    "candidate-4.13",
    "eus-4.12",
    "fast-4.12",
    "stable-4.12"
  ],
  "image": "quay.example.com/playground/release-images@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36 ",
  "url": "https://access.redhat.com/errata/RHSA-2023:0728",
  "version": "4.12.3"
}$ oc get clusterversion/version -o jsonpath='{.status.history}'|jq
[
  {
    "completionTime": null,
    "image": "quay.example.com/playground/release-images@sha256:382f271581b9b907484d552bd145e9a5678e9366330059d31b007f4445d99e36 ",
    "startedTime": "2023-04-28T07:39:11Z",
    "state": "Partial",
    "verified": true,
    "version": "4.12.3"
  },
  {
    "completionTime": "2023-04-27T14:48:06Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:31c7741fc7bb73ff752ba43f5acf014b8fadd69196fc522241302de918066cb1",
    "startedTime": "2023-04-27T14:24:29Z",
    "state": "Completed",
    "verified": false,
    "version": "4.12.2"
  }
]

Additional info:

While in earlier versions (<4.10) we used to preserve the specified image [1], we now (as of 4.10) store the public image as the desired version [2].
[1] https://github.com/openshift/oc/blob/88cfeb4aa2d74ee5f5598c571661622c0034081b/pkg/cli/admin/upgrade/upgrade.go#L278
[2] https://github.com/openshift/oc/blob/5711859fac135177edf07161615bdabe3527e659/pkg/cli/admin/upgrade/upgrade.go#L278 

Description of problem:

In a fresh installed cluster, we can see hot-loopings on Service openshift-monitoring/cluster-monitoring-operator.

  1. grep -o 'Updating .*due to diff' cvo2.log | sort | uniq -c
    18 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff
    12 Updating Service openshift-monitoring/cluster-monitoring-operator due to diff

Looking at the CronJob hot-looping

# grep -A60 'Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff' cvo2.log | tail -n61
I0110 06:32:44.489277       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
  	Object: map[string]interface{}{
  		"apiVersion": string("batch/v1"),
  		"kind":       string("CronJob"),
  		"metadata":   map[string]interface{}{"annotations": map[string]interface{}{"include.release.openshift.io/ibm-cloud-managed": string("true"), "include.release.openshift.io/self-managed-high-availability": string("true")}, "creationTimestamp": string("2022-01-10T04:35:19Z"), "generation": int64(1), "managedFields": []interface{}{map[string]interface{}{"apiVersion": string("batch/v1"), "fieldsType": string("FieldsV1"), "fieldsV1": map[string]interface{}{"f:metadata": map[string]interface{}{"f:annotations": map[string]interface{}{".": map[string]interface{}{}, "f:include.release.openshift.io/ibm-cloud-managed": map[string]interface{}{}, "f:include.release.openshift.io/self-managed-high-availability": map[string]interface{}{}}, "f:ownerReferences": map[string]interface{}{".": map[string]interface{}{}, `k:{"uid":"334d6c04-126d-4271-96ec-d303e93b7d1c"}`: map[string]interface{}{}}}, "f:spec": map[string]interface{}{"f:concurrencyPolicy": map[string]interface{}{}, "f:failedJobsHistoryLimit": map[string]interface{}{}, "f:jobTemplate": map[string]interface{}{"f:spec": map[string]interface{}{"f:template": map[string]interface{}{"f:spec": map[string]interface{}{"f:containers": map[string]interface{}{`k:{"name":"collect-profiles"}`: map[string]interface{}{".": map[string]interface{}{}, "f:args": map[string]interface{}{}, "f:command": map[string]interface{}{}, "f:image": map[string]interface{}{}, ...}}, "f:dnsPolicy": map[string]interface{}{}, "f:priorityClassName": map[string]interface{}{}, "f:restartPolicy": map[string]interface{}{}, ...}}}}, "f:schedule": map[string]interface{}{}, ...}}, "manager": string("cluster-version-operator"), ...}, map[string]interface{}{"apiVersion": string("batch/v1"), "fieldsType": string("FieldsV1"), "fieldsV1": map[string]interface{}{"f:status": map[string]interface{}{"f:lastScheduleTime": map[string]interface{}{}, "f:lastSuccessfulTime": map[string]interface{}{}}}, "manager": string("kube-controller-manager"), ...}}, ...},
  		"spec": map[string]interface{}{
+ 			"concurrencyPolicy":      string("Allow"),
+ 			"failedJobsHistoryLimit": int64(1),
  			"jobTemplate": map[string]interface{}{
+ 				"metadata": map[string]interface{}{"creationTimestamp": nil},
  				"spec": map[string]interface{}{
  					"template": map[string]interface{}{
+ 						"metadata": map[string]interface{}{"creationTimestamp": nil},
  						"spec": map[string]interface{}{
  							"containers": []interface{}{
  								map[string]interface{}{
  									... // 4 identical entries
  									"name":                     string("collect-profiles"),
  									"resources":                map[string]interface{}{"requests": map[string]interface{}{"cpu": string("10m"), "memory": string("80Mi")}},
+ 									"terminationMessagePath":   string("/dev/termination-log"),
+ 									"terminationMessagePolicy": string("File"),
  									"volumeMounts":             []interface{}{map[string]interface{}{"mountPath": string("/etc/config"), "name": string("config-volume")}, map[string]interface{}{"mountPath": string("/var/run/secrets/serving-cert"), "name": string("secret-volume")}},
  								},
  							},
+ 							"dnsPolicy":                     string("ClusterFirst"),
  							"priorityClassName":             string("openshift-user-critical"),
  							"restartPolicy":                 string("Never"),
+ 							"schedulerName":                 string("default-scheduler"),
+ 							"securityContext":               map[string]interface{}{},
+ 							"serviceAccount":                string("collect-profiles"),
  							"serviceAccountName":            string("collect-profiles"),
+ 							"terminationGracePeriodSeconds": int64(30),
  							"volumes": []interface{}{
  								map[string]interface{}{
  									"configMap": map[string]interface{}{
+ 										"defaultMode": int64(420),
  										"name":        string("collect-profiles-config"),
  									},
  									"name": string("config-volume"),
  								},
  								map[string]interface{}{
  									"name": string("secret-volume"),
  									"secret": map[string]interface{}{
+ 										"defaultMode": int64(420),
  										"secretName":  string("pprof-cert"),
  									},
  								},
  							},
  						},
  					},
  				},
  			},
  			"schedule":                   string("*/15 * * * *"),
+ 			"successfulJobsHistoryLimit": int64(3),
+ 			"suspend":                    bool(false),
  		},
  		"status": map[string]interface{}{"lastScheduleTime": string("2022-01-10T06:30:00Z"), "lastSuccessfulTime": string("2022-01-10T06:30:11Z")},
  	},
  }
I0110 06:32:44.499764       1 sync_worker.go:771] Done syncing for cronjob "openshift-operator-lifecycle-manager/collect-profiles" (574 of 765)
I0110 06:32:44.499814       1 sync_worker.go:759] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (575 of 765)

Extract the manifest:

# cat 0000_50_olm_07-collect-profiles.cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
  name: collect-profiles
  namespace: openshift-operator-lifecycle-manager
spec:
  schedule: "*/15 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: collect-profiles
          priorityClassName: openshift-user-critical
          containers:
            - name: collect-profiles
              image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a8d116943a7c1eb32cd161a0de5cb173713724ff419a03abe0382a2d5d9c9a7
              imagePullPolicy: IfNotPresent
              command:
                - bin/collect-profiles
              args:
                - -n
                - openshift-operator-lifecycle-manager
                - --config-mount-path
                - /etc/config
                - --cert-mount-path
                - /var/run/secrets/serving-cert
                - olm-operator-heap-:https://olm-operator-metrics:8443/debug/pprof/heap
                - catalog-operator-heap-:https://catalog-operator-metrics:8443/debug/pprof/heap
              volumeMounts:
                - mountPath: /etc/config
                  name: config-volume
                - mountPath: /var/run/secrets/serving-cert
                  name: secret-volume
              resources:
                requests:
                  cpu: 10m
                  memory: 80Mi
          volumes:
            - name: config-volume
              configMap:
                name: collect-profiles-config
            - name: secret-volume
              secret:
                secretName: pprof-cert
          restartPolicy: Never

Looking at the in-cluster object:

# oc get cronjob.batch/collect-profiles -oyaml -n openshift-operator-lifecycle-manager
apiVersion: batch/v1
kind: CronJob
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
  creationTimestamp: "2022-01-10T04:35:19Z"
  generation: 1
  name: collect-profiles
  namespace: openshift-operator-lifecycle-manager
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: 334d6c04-126d-4271-96ec-d303e93b7d1c
  resourceVersion: "450801"
  uid: d0b92cd3-3213-466c-921c-d4c4c77f7a6b
spec:
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - args:
            - -n
            - openshift-operator-lifecycle-manager
            - --config-mount-path
            - /etc/config
            - --cert-mount-path
            - /var/run/secrets/serving-cert
            - olm-operator-heap-:https://olm-operator-metrics:8443/debug/pprof/heap
            - catalog-operator-heap-:https://catalog-operator-metrics:8443/debug/pprof/heap
            command:
            - bin/collect-profiles
            image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a8d116943a7c1eb32cd161a0de5cb173713724ff419a03abe0382a2d5d9c9a7
            imagePullPolicy: IfNotPresent
            name: collect-profiles
            resources:
              requests:
                cpu: 10m
                memory: 80Mi
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /etc/config
              name: config-volume
            - mountPath: /var/run/secrets/serving-cert
              name: secret-volume
          dnsPolicy: ClusterFirst
          priorityClassName: openshift-user-critical
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext: {}
          serviceAccount: collect-profiles
          serviceAccountName: collect-profiles
          terminationGracePeriodSeconds: 30
          volumes:
          - configMap:
              defaultMode: 420
              name: collect-profiles-config
            name: config-volume
          - name: secret-volume
            secret:
              defaultMode: 420
              secretName: pprof-cert
  schedule: '*/15 * * * *'
  successfulJobsHistoryLimit: 3
  suspend: false
status:
  lastScheduleTime: "2022-01-11T03:00:00Z"
  lastSuccessfulTime: "2022-01-11T03:00:07Z"

Version-Release number of the following components:
4.10.0-0.nightly-2022-01-09-195852

How reproducible:
1/1

Steps to Reproduce:
1.Install a 4.10 cluster
2. Grep 'Updating .*due to diff' in the cvo log to check hot-loopings
3.

Actual results:
CVO hotloops on CronJob openshift-operator-lifecycle-manager/collect-profiles

Expected results:
CVO should not hotloop on it in a fresh installed cluster

Additional info:
attachment 1850058 CVO log file

Due to removal of in-tree AWS provider https://github.com/kubernetes/kubernetes/pull/115838 we need to ensure that KCM is setting --external-cloud-volume-plugin flag accordingly, especially that the CSI migration was GA-ed in 4.12/1.25.

Description of problem:

When trying to delete a BMH object, which is unmanaged, the Metal3 cannot delete. The BMH object is unmanaged because it does not provide information about BMC (neither address, nor credentials). 

In this case the Metal 3 tries to delete but fails and never finalizes. The BMH deletion gets stuc.
This is the log from MEtal3

{"level":"info","ts":1676531586.4898946,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676531586.4980938,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676531586.5050912,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676531586.5105371,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                
{"level":"info","ts":1676531586.51569,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org"}                                                                                            
{"level":"info","ts":1676531586.5191178,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                
{"level":"info","ts":1676531586.525755,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                 
{"level":"info","ts":1676531586.5356712,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                
{"level":"info","ts":1676532186.5117555,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532186.5195107,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532186.526355,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org"}                                                                                           
{"level":"info","ts":1676532186.5317476,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532186.5361836,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532186.5404322,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532186.5482726,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532186.555394,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532532.3448665,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532532.344922,"logger":"controllers.BareMetalHost","msg":"hardwareData is ready to be deleted","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org"}
{"level":"info","ts":1676532532.3656478,"logger":"controllers.BareMetalHost","msg":"Initiating host deletion","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged"}
{"level":"error","ts":1676532532.3656952,"msg":"Reconciler error","controller":"baremetalhost","controllerGroup":"metal3.io","controllerKind":"BareMetalHost","bareMetalHost":{"name":"worker-1.el8k-ztp-1.hpecloud.org","namespace":"openshift-machine-api"},
"namespace":"openshift-machine-api","name":"worker-1.el8k-ztp-1.hpecloud.org","reconcileID":"525a5b7d-077d-4d1e-a618-33d6041feb33","error":"action \"unmanaged\" failed: failed to determine current provisioner capacity: failed to parse BMC address informa
tion: missing BMC address","errorVerbose":"missing BMC address\ngithub.com/metal3-io/baremetal-operator/pkg/hardwareutils/bmc.NewAccessDetails\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/github.com/metal3-io/baremetal-operator/pkg/hardwareu
tils/bmc/access.go:145\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).bmcAccess\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:112\ngithub.com/metal3-io/baremetal-operator/pkg/pro
visioner/ironic.(*ironicProvisioner).HasCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1922\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ensureCapacity\n\t/go/src/githu
b.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:83\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/meta
l3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:175\ngithub.com/metal
3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareM
etalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremet
al-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/contr
oller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/contro
ller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\
n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\nfailed to parse BMC address information\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).bmcAccess\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/iro
nic/ironic.go:114\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).HasCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1922\ngithub.com/metal3-io/baremetal-operator/controlle
rs/metal3%2eio.(*hostStateMachine).ensureCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:83\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n
\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator
/controllers/metal3.io/host_state_machine.go:175\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithu
b.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controll
er.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/sr
c/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-
operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-
runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\nfailed to determine current provisioner capacity\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ensur
eCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:85\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n\t/go/src/github.com/metal3-io/baremetal
-operator/controllers/metal3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machin
e.go:175\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithub.com/metal3-io/baremetal-operator/contr
ollers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/gi
thub.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operato
r/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-r
untime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controll
er.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\naction \"unmanaged\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operato
r/controllers/metal3.io/baremetalhost_controller.go:230\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/contr
oller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller
-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.
(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594","stacktrace":"sigs.k8s.io/cont
roller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/contr
oller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Provide a BMH object with no BMC credentials. The BMH is set unmanaged.

Steps to Reproduce:

1. delete the object
2. gets stuck
3.

Actual results:

get stuck deletiong

Expected results:

Metal3 detects the BMH is unmanaged, and dont try to do deprovisioning.

Additional info:

 

Description of problem: 

"pipelines-as-code-pipelinerun-go" configMap is not been used for the Go repository while creating Pipeline Repository. "pipelines-as-code-pipelinerun-generic" configMap has been used.

Prerequisites (if any, like setup, operators/versions):

Install Red Hat Pipeline operator

Steps to Reproduce

  1. Navigate to Create Repository form 
  2. Enter the Git URL `https://github.com/vikram-raj/hello-func-go`
  3. Click on Add

Actual results:

`pipelines-as-code-pipelinerun-generic` PipelineRun template has been shown on the overview page 

Expected results:

`pipelines-as-code-pipelinerun-go` PipelineRun template should show on the overview page

Reproducibility (Always/Intermittent/Only Once):

Build Details:

4.13

Workaround:

Additional info:

Description of problem:

APIServer endpoint isn't healthy after a PublicAndPrivate cluster is created. PROGRESS  of the cluster is Completed and PROCESS is false, Nodes are ready, cluster operators on the guest cluster are Available, only issue is condition Type Available is False due to APIServer endpoint is not healthy.

jiezhao-mac:hypershift jiezhao$ oc get hostedcluster -n clusters
NAME   VERSION               KUBECONFIG         PROGRESS  AVAILABLE  PROGRESSING  MESSAGE
jz-test  4.14.0-0.nightly-2023-04-30-235516  jz-test-admin-kubeconfig  Completed  False    False     APIServer endpoint a23663b1e738a4d6783f6256da73fe76-2649b36a23f49ed7.elb.us-east-2.amazonaws.com is not healthy

jiezhao-mac:hypershift jiezhao$ oc get hostedcluster/jz-test -n clusters -ojsonpath='{.spec.platform.aws.endpointAccess}{"\n"}'
PublicAndPrivate

jiezhao-mac:hypershift jiezhao$ oc get pods -n clusters-jz-test
NAME                                                  READY   STATUS    RESTARTS   AGE
aws-cloud-controller-manager-666559d4f-rdsw4          2/2     Running   0          149m
aws-ebs-csi-driver-controller-79fdfb6c76-vb7wr        7/7     Running   0          148m
aws-ebs-csi-driver-operator-7dbd789984-mb9rp          1/1     Running   0          148m
capi-provider-5b7847db9-nlrvz                         2/2     Running   0          151m
catalog-operator-7ccb468d86-7c5j6                     2/2     Running   0          149m
certified-operators-catalog-895787778-5rjb6           1/1     Running   0          149m
cloud-network-config-controller-86698fd7dd-kgzhv      3/3     Running   0          148m
cluster-api-6fd4f86878-hjw59                          1/1     Running   0          151m
cluster-autoscaler-bdd688949-f9xmk                    1/1     Running   0          150m
cluster-image-registry-operator-6f5cb67d88-8svd6      3/3     Running   0          149m
cluster-network-operator-7bc69f75f4-npjfs             1/1     Running   0          149m
cluster-node-tuning-operator-5855b6576b-rckhh         1/1     Running   0          149m
cluster-policy-controller-56d4d6b57c-glx4w            1/1     Running   0          149m
cluster-storage-operator-7cc56c68bb-jd4d2             1/1     Running   0          149m
cluster-version-operator-bd969b677-bh4w4              1/1     Running   0          149m
community-operators-catalog-5c545484d7-hbzb4          1/1     Running   0          149m
control-plane-operator-fc49dcbb4-5ncvf                2/2     Running   0          151m
csi-snapshot-controller-85f7cc9945-n5vgq              1/1     Running   0          149m
csi-snapshot-controller-operator-6597b45897-hqf5p     1/1     Running   0          149m
csi-snapshot-webhook-644d765546-lk9hj                 1/1     Running   0          149m
dns-operator-5b5577d6c7-8dh8d                         1/1     Running   0          149m
etcd-0                                                2/2     Running   0          150m
hosted-cluster-config-operator-5b75ccf55d-6rzch       1/1     Running   0          149m
ignition-server-596fc9d9fb-sb94h                      1/1     Running   0          150m
ingress-operator-6497d476bc-whssz                     3/3     Running   0          149m
konnectivity-agent-6656d8dfd6-h5tcs                   1/1     Running   0          150m
konnectivity-server-5ff9d4b47-stb2m                   1/1     Running   0          150m
kube-apiserver-596fc4bb8b-7kfd8                       3/3     Running   0          150m
kube-controller-manager-6f86bb7fbd-4wtxk              1/1     Running   0          138m
kube-scheduler-bf5876b4b-flk96                        1/1     Running   0          149m
machine-approver-574585d8dd-h5ffh                     1/1     Running   0          150m
multus-admission-controller-67b6f85fbf-bfg4x          2/2     Running   0          148m
oauth-openshift-6b6bfd55fb-8sdq7                      2/2     Running   0          148m
olm-operator-5d97fb977c-sbf6w                         2/2     Running   0          149m
openshift-apiserver-5bb9f99974-2lfp4                  3/3     Running   0          138m
openshift-controller-manager-65666bdf79-g8cf5         1/1     Running   0          149m
openshift-oauth-apiserver-56c8565bb6-6b5cv            2/2     Running   0          149m
openshift-route-controller-manager-775f844dfc-jj2ft   1/1     Running   0          149m
ovnkube-master-0                                      7/7     Running   0          148m
packageserver-6587d9674b-6jwpv                        2/2     Running   0          149m
redhat-marketplace-catalog-5f6d45b457-hdn77           1/1     Running   0          149m
redhat-operators-catalog-7958c4449b-l4hbx             1/1     Running   0          12m
router-5b7899cc97-chs6t                               1/1     Running   0          150m

jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig 
NAME                                        STATUS   ROLES    AGE    VERSION
ip-10-0-137-99.us-east-2.compute.internal   Ready    worker   131m   v1.26.2+d2e245f
ip-10-0-140-85.us-east-2.compute.internal   Ready    worker   132m   v1.26.2+d2e245f
ip-10-0-141-46.us-east-2.compute.internal   Ready    worker   131m   v1.26.2+d2e245f
jiezhao-mac:hypershift jiezhao$ 
jiezhao-mac:hypershift jiezhao$ oc get co --kubeconfig=hostedcluster.kubeconfig 
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console                                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      126m    
csi-snapshot-controller                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
dns                                        4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
image-registry                             4.14.0-0.nightly-2023-04-30-235516   True        False         False      128m    
ingress                                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
insights                                   4.14.0-0.nightly-2023-04-30-235516   True        False         False      130m    
kube-apiserver                             4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
kube-controller-manager                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
kube-scheduler                             4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
kube-storage-version-migrator              4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
monitoring                                 4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
network                                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
node-tuning                                4.14.0-0.nightly-2023-04-30-235516   True        False         False      131m    
openshift-apiserver                        4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
openshift-controller-manager               4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
openshift-samples                          4.14.0-0.nightly-2023-04-30-235516   True        False         False      129m    
operator-lifecycle-manager                 4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
operator-lifecycle-manager-catalog         4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
operator-lifecycle-manager-packageserver   4.14.0-0.nightly-2023-04-30-235516   True        False         False      140m    
service-ca                                 4.14.0-0.nightly-2023-04-30-235516   True        False         False      130m    
storage                                    4.14.0-0.nightly-2023-04-30-235516   True        False         False      131m    
jiezhao-mac:hypershift jiezhao$ 

HC conditions:
==============
  status:
    conditions:
    - lastTransitionTime: "2023-05-01T19:45:49Z"
      message: All is well
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidAWSIdentityProvider
    - lastTransitionTime: "2023-05-01T20:00:18Z"
      message: Cluster version is 4.14.0-0.nightly-2023-04-30-235516
      observedGeneration: 3
      reason: FromClusterVersion
      status: "False"
      type: ClusterVersionProgressing
    - lastTransitionTime: "2023-05-01T19:46:22Z"
      message: Payload loaded version="4.14.0-0.nightly-2023-04-30-235516" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-04-30-235516"
        architecture="amd64"
      observedGeneration: 3
      reason: PayloadLoaded
      status: "True"
      type: ClusterVersionReleaseAccepted
    - lastTransitionTime: "2023-05-01T20:03:14Z"
      message: Condition not found in the CVO.
      observedGeneration: 3
      reason: StatusUnknown
      status: Unknown
      type: ClusterVersionUpgradeable
    - lastTransitionTime: "2023-05-01T20:00:18Z"
      message: Done applying 4.14.0-0.nightly-2023-04-30-235516
      observedGeneration: 3
      reason: FromClusterVersion
      status: "True"
      type: ClusterVersionAvailable
    - lastTransitionTime: "2023-05-01T20:00:18Z"
      message: ""
      observedGeneration: 3
      reason: FromClusterVersion
      status: "True"
      type: ClusterVersionSucceeding
    - lastTransitionTime: "2023-05-01T19:47:51Z"
      message: The hosted cluster is not degraded
      observedGeneration: 3
      reason: AsExpected
      status: "False"
      type: Degraded
    - lastTransitionTime: "2023-05-01T19:45:01Z"
      message: ""
      observedGeneration: 3
      reason: QuorumAvailable
      status: "True"
      type: EtcdAvailable
    - lastTransitionTime: "2023-05-01T19:45:38Z"
      message: Kube APIServer deployment is available
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: KubeAPIServerAvailable
    - lastTransitionTime: "2023-05-01T19:44:27Z"
      message: All is well
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: InfrastructureReady
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: External DNS is not configured
      observedGeneration: 3
      reason: StatusUnknown
      status: Unknown
      type: ExternalDNSReachable
    - lastTransitionTime: "2023-05-01T19:44:19Z"
      message: Configuration passes validation
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidHostedControlPlaneConfiguration
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: AWS KMS is not configured
      observedGeneration: 3
      reason: StatusUnknown
      status: Unknown
      type: ValidAWSKMSConfig
    - lastTransitionTime: "2023-05-01T19:44:37Z"
      message: All is well
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidReleaseInfo
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: APIServer endpoint a23663b1e738a4d6783f6256da73fe76-2649b36a23f49ed7.elb.us-east-2.amazonaws.com
        is not healthy
      observedGeneration: 3
      reason: waitingForAvailable
      status: "False"
      type: Available
    - lastTransitionTime: "2023-05-01T19:47:18Z"
      message: All is well
      reason: AWSSuccess
      status: "True"
      type: AWSEndpointAvailable
    - lastTransitionTime: "2023-05-01T19:47:18Z"
      message: All is well
      reason: AWSSuccess
      status: "True"
      type: AWSEndpointServiceAvailable
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: Configuration passes validation
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidConfiguration
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: HostedCluster is supported by operator configuration
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: SupportedHostedCluster
    - lastTransitionTime: "2023-05-01T19:45:39Z"
      message: Ignition server deployment is available
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: IgnitionEndpointAvailable
    - lastTransitionTime: "2023-05-01T19:44:11Z"
      message: Reconciliation active on resource
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ReconciliationActive
    - lastTransitionTime: "2023-05-01T19:44:12Z"
      message: Release image is valid
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidReleaseImage
    - lastTransitionTime: "2023-05-01T19:44:12Z"
      message: HostedCluster is at expected version
      observedGeneration: 3
      reason: AsExpected
      status: "False"
      type: Progressing
    - lastTransitionTime: "2023-05-01T19:44:13Z"
      message: OIDC configuration is valid
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: ValidOIDCConfiguration
    - lastTransitionTime: "2023-05-01T19:44:13Z"
      message: Reconciliation completed succesfully
      observedGeneration: 3
      reason: ReconciliatonSucceeded
      status: "True"
      type: ReconciliationSucceeded
    - lastTransitionTime: "2023-05-01T19:45:52Z"
      message: All is well
      observedGeneration: 3
      reason: AsExpected
      status: "True"
      type: AWSDefaultSecurityGroupCreated

kube-apiserver log:
==================
E0501 19:45:07.024278       7 memcache.go:238] couldn't get current server API group list: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_authorization-openshift_01_rolebindingrestriction.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_config-operator_01_proxy.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_quota-openshift_01_clusterresourcequota.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_security-openshift_01_scc.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_03_securityinternal-openshift_02_rangeallocation.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_apiserver-Default.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_authentication.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_build.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_console.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_dns.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_featuregate.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_image.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_imagecontentpolicy.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_imagecontentsourcepolicy.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_imagedigestmirrorset.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_imagetagmirrorset.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_infrastructure-Default.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_ingress.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_network.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_node.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_oauth.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_project.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused
unable to recognize "/work/0000_10_config-operator_01_scheduler.crd.yaml": Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. Create a PublicAndPrivate cluster

Actual results:

APIServer endpoint is not healthy, and HC condition Type 'Available' is False

Expected results:

APIServer endpoint should be healthy, and Type 'Available' should be True

Additional info:

 

Description of problem:

While updating a cluster to 4.12.11, which contains the bug fix for [OCPBUGS-7999|https://issues.redhat.com/browse/OCPBUGS-7999] (which is the 4.12.z backport of [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783], it seems that the older {{{Custom|Default}RouteSync{Degraded|Progressing}}} conditions are not cleaned up as they should, as per [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783] resolution, while the newer ones are added.

Due to this, on an upgrade to 4.12.11 (or higher, until this bug is fixed), it is possible to hit a problem very similar to the one that lead to [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783] in the first place, but while upgrading to 4.12.11.

So, we need to do a proper cleanup of the older conditions.

Version-Release number of selected component (if applicable):

4.12.11 and higher

How reproducible:

Always in what regards the wrong conditions. It only leads to issues if one of the wrong conditions was in unhealthy state.

Steps to Reproduce:

1. Upgrade
2.
3.

Actual results:

Both new (and correct) conditions plus older (and wrong) conditions.

Expected results:

Both new (and correct) conditions only.

Additional info:

Problem seems to be that the stale conditions controller is created[1] with a list that says {{CustomRouteSync}} and {{DefaultRouteSync}}, while that list should be {{CustomRouteSyncDegraded}}, {{CustomRouteSyncProgressing}}, {{DefaultRouteSyncDegraded}} and {{DefaultRouteSyncProgressing}}. I read the source code of the controller a bit and it seems that it does not admit prefixes but performs a literal comparison.

[1] - https://github.com/openshift/console-operator/blob/0b54727/pkg/console/starter/starter.go#L403-L404

Description of problem:

The NS autolabeler should adjust the PSS namespace labels such that a previously permitted workload (based on the SCCs it has access to) can still run.

The autolabeler requires the RoleBinding's .subjects[].namespace to be set when .subjects[].kind is ServiceAccount even though this is not required by the RBAC system to successfully bind the SA to a Role

Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.7.0-0.ci-2021-05-21-142747
Server Version: 4.12.0-0.nightly-2022-08-15-150248
Kubernetes Version: v1.24.0+da80cd0

How reproducible: 100%

Steps to Reproduce:

---
apiVersion: v1
kind: Namespace
metadata:
  name: test

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: mysa
  namespace: test

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: myrole
  namespace: test
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - privileged
  resources:
  - securitycontextconstraints
  verbs:
  - use

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: myrb
  namespace: test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: myrole
subjects:
- kind: ServiceAccount
  name: mysa
  #namespace: test  # This is required for the autolabeler

---
kind: Job
apiVersion: batch/v1
metadata:
  name: myjob
  namespace: test
spec:
  template:
    spec:
      containers:
        - name: ubi
          image: registry.access.redhat.com/ubi8
          command: ["/bin/bash", "-c"]
          args: ["whoami; sleep infinity"]
      restartPolicy: Never
      securityContext:
        runAsUser: 0
      serviceAccount: mysa
      terminationGracePeriodSeconds: 2
{{}}

Actual results:

Applying the manifest, above, the Job's pod will not start:

$ kubectl -n test describe job/myjob...Events:
  Type     Reason        Age   From            Message
  ----     ------        ----  ----            -------
  Warning  FailedCreate  20s   job-controller  Error creating: pods "myjob-zxcvv" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
  Warning  FailedCreate  20s   job-controller  Error creating: pods "myjob-fkb9x" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
  Warning  FailedCreate  10s   job-controller  Error creating: pods "myjob-5klpc" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Uncommenting the "namespace" field in the RoleBinding will allow it to start as the autolabeler will adjust the Namespace labels.

However, the namespace field isn't actually required by the RBAC system. Instead of using the autolabeler, the pod can be allowed to run by (w/o uncommenting the field):

$ kubectl label ns/test security.openshift.io/scc.podSecurityLabelSync=false
namespace/test labeled
$ kubectl label ns/test pod-security.kubernetes.io/enforce=privileged --overwrite
namespace/test labeled

 

We now see that the pod is running as root and has access to the privileged scc:

$ kubectl -n test get po -oyaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.129.2.18/23"],"mac_address":"0a:58:0a:81:02:12","gateway_ips":["10.129.2.1"],"ip_address":"10.129.2.18/23","gateway_ip":"10.129.2.1"'}}
      k8s.v1.cni.cncf.io/network-status: |-
        [{
            "name": "ovn-kubernetes",
            "interface": "eth0",
            "ips": [
                "10.129.2.18"
            ],
            "mac": "0a:58:0a:81:02:12",
            "default": true,
            "dns": {}
        }]
      k8s.v1.cni.cncf.io/networks-status: |-
        [{
            "name": "ovn-kubernetes",
            "interface": "eth0",
            "ips": [
                "10.129.2.18"
            ],
            "mac": "0a:58:0a:81:02:12",
            "default": true,
            "dns": {}
        }]
      openshift.io/scc: privileged
    creationTimestamp: "2022-08-16T13:08:24Z"
    generateName: myjob-
    labels:
      controller-uid: 1867dbe6-73b2-44ea-a324-45c9273107b8
      job-name: myjob
    name: myjob-rwjmv
    namespace: test
    ownerReferences:
    - apiVersion: batch/v1
      blockOwnerDeletion: true
      controller: true
      kind: Job
      name: myjob
      uid: 1867dbe6-73b2-44ea-a324-45c9273107b8
    resourceVersion: "36418"
    uid: 39f18dea-31d4-4783-85b5-8ae6a8bec1f4
  spec:
    containers:
    - args:
      - whoami; sleep infinity
      command:
      - /bin/bash
      - -c
      image: registry.access.redhat.com/ubi8
      imagePullPolicy: Always
      name: ubi
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-6f2h6
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    imagePullSecrets:
    - name: mysa-dockercfg-mvmtn
    nodeName: ip-10-0-140-172.ec2.internal
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Never
    schedulerName: default-scheduler
    securityContext:
      runAsUser: 0
    serviceAccount: mysa
    serviceAccountName: mysa
    terminationGracePeriodSeconds: 2
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: kube-api-access-6f2h6
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
        - configMap:
            items:
            - key: service-ca.crt
              path: service-ca.crt
            name: openshift-service-ca.crt
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:24Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:28Z"
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:28Z"
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:24Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: cri-o://8fd1c3a5ee565a1089e4e6032bd04bceabb5ab3946c34a2bb55d3ee696baa007
      image: registry.access.redhat.com/ubi8:latest
      imageID: registry.access.redhat.com/ubi8@sha256:08e221b041a95e6840b208c618ae56c27e3429c3dad637ece01c9b471cc8fac6
      lastState: {}
      name: ubi
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: "2022-08-16T13:08:28Z"
    hostIP: 10.0.140.172
    phase: Running
    podIP: 10.129.2.18
    podIPs:
    - ip: 10.129.2.18
    qosClass: BestEffort
    startTime: "2022-08-16T13:08:24Z"
kind: List
metadata:
  resourceVersion: ""
{{}}

 

$ kubectl -n test logs job/myjob
root

 

Expected results:

The autolabeler should properly follow the RoleBinding back to the SCC

 

Additional info:

Version:

$ openshift-install version

./openshift-install 4.9.11
built from commit 4ee186bb88bf6aeef8ccffd0b5d4e98e9ddd895f
release image quay.io/openshift-release-dev/ocp-release@sha256:0f72e150329db15279a1aeda1286c9495258a4892bc5bf1bf5bb89942cd432de
release architecture amd64

Platform: Openstack

install type: IPI

What happened?

Image streams using the swift container to store the images, after running so many image streams I am able to see the huge number of objects in the swift container if I destroy the cluster now, it takes huge time based on the size of the swift container

What did you expect to happen?

The destroy script should clean the resources in some reasonable time

How to reproduce it (as minimally and precisely as possible)?

deploy OCP, run some workload which creates a lot of image streams and destroy the cluster, it will take a lot of time to complete the destroy cmd

Anything else we need to know?

here is the output of the swift state cmd and the time it took to complete the destroy job

$ swift stat vlan609-26jxm-image-registry-nseyclolgfgxoaiysrlejlhvoklcawbxt
Account: AUTH_2b4d979a2a9e4cf88b2509e9c5e0e232
Container: vlan609-26jxm-image-registry-nseyclolgfgxoaiysrlejlhvoklcawbxt
Objects: 723756
Bytes: 652448740473
Read ACL:
Write ACL:
Sync To:
Sync Key:
Meta Name: vlan609-26jxm-image-registry-nseyclolgfgxoaiysrlejlhvoklcawbxt
Meta Openshiftclusterid: vlan609-26jxm
Content-Type: application/json; charset=utf-8
X-Timestamp: 1640248399.77606
Last-Modified: Thu, 23 Dec 2021 08:34:48 GMT
Accept-Ranges: bytes
X-Storage-Policy: Policy-0
X-Trans-Id: txb0717d5198e344a5a095d-0061c93b70
X-Openstack-Request-Id: txb0717d5198e344a5a095d-0061c93b70

Time took to complete the destroy: 6455.42s

Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/355

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

I get synchronization error in fully disconnected environment when i synchronize two time with the target mirror and there no change / diff between first synchronization and second.  The first time synchronization works, on second synchronization there is an error and exit code -1.

 

This case occurs when you want synchronize your disconnected registry regularly and there is no change between two synchronization.

This case is presented hereafter:
https://d