Back to index

4.12.0-0.okd-2023-03-05-022504

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.11.0-0.okd-2023-12-20-003021

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

1. Proposed title of this feature request
Add runbook_url to alerts in the OCP UI

2. What is the nature and description of the request?
If an alert includes a runbook_url label, then it should appear in the UI for the alert as a link.

3. Why does the customer need this? (List the business requirements here)
Customer can easily reach the alert runbook and be able to address their issues.

4. List any affected packages or components.

Epic Goal

  • Make it possible to disable the console operator at install time, while still having a supported+upgradeable cluster.

Why is this important?

  • It's possible to disable console itself using spec.managementState in the console operator config. There is no way to remove the console operator, though. For clusters where an admin wants to completely remove console, we should give the option to disable the console operator as well.

Scenarios

  1. I'm an administrator who wants to minimize my OpenShift cluster footprint and who does not want the console installed on my cluster

Acceptance Criteria

  • It is possible at install time to opt-out of having the console operator installed. Once the cluster comes up, the console operator is not running.

Dependencies (internal and external)

  1. Composable cluster installation

Previous Work (Optional):

  1. https://docs.google.com/document/d/1srswUYYHIbKT5PAC5ZuVos9T2rBnf7k0F1WV2zKUTrA/edit#heading=h.mduog8qznwz
  2. https://docs.google.com/presentation/d/1U2zYAyrNGBooGBuyQME8Xn905RvOPbVv3XFw3stddZw/edit#slide=id.g10555cc0639_0_7

Open questions::

  1. The console operator manages the downloads deployment as well. Do we disable the downloads deployment? Long term we want to move to CLI manager: https://github.com/openshift/enhancements/blob/6ae78842d4a87593c63274e02ac7a33cc7f296c3/enhancements/oc/cli-manager.md

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

In the console-operator repo we need to add `capability.openshift.io/console` annotation to all the manifests that the operator either contains creates on the fly.

 

Manifests are currently present in /bindata and /manifest directories.

 

Here is example of the insights-operator change.

Here is the overall enhancement doc.

 

Feature Overview
Provide CSI drivers to replace all the intree cloud provider drivers we currently have. These drivers will probably be released as tech preview versions first before being promoted to GA.

Goals

  • Framework for rapid creation of CSI drivers for our cloud providers
  • CSI driver for AWS EBS
  • CSI driver for AWS EFS
  • CSI driver for GCP
  • CSI driver for Azure
  • CSI driver for VMware vSphere
  • CSI Driver for Azure Stack
  • CSI Driver for Alicloud
  • CSI Driver for IBM Cloud

Requirements

Requirement Notes isMvp?
Framework for CSI driver  TBD Yes
Drivers should be available to install both in disconnected and connected mode   Yes
Drivers should upgrade from release to release without any impact   Yes
Drivers should be installable via CVO (when in-tree plugin exists)    

Out of Scope

This work will only cover the drivers themselves, it will not include

  • enhancements to the CSI API framework
  • the migration to said drivers from the the intree drivers
  • work for non-cloud provider storage drivers (FC-SAN, iSCSI) being converted to CSI drivers

Background, and strategic fit
In a future Kubernetes release (currently 1.21) intree cloud provider drivers will be deprecated and replaced with CSI equivalents, we need the drivers created so that we continue to support the ecosystems in an appropriate way.

Assumptions

  • Storage SIG won't move out the changeover to a later Kubernetes release

Customer Considerations
Customers will need to be able to use the storage they want.

Documentation Considerations

  • Target audience: cluster admins
  • Updated content: update storage docs to show how to use these drivers (also better expose the capabilities)

This Epic is to track the GA of this feature

Goal

  • Make available the Google Cloud File Service via a CSI driver, it is desirable that this implementation has dynamic provisioning
  • Without GCP filestore support, we are limited to block / RWO only (GCP PD 4.8 GA)
  • Align with what we support on other major public cloud providers.

Why is this important?

  • There is a know storage gap with google cloud where only block is supported
  • More customers deploying on GCE and asking for file / RWX storage.

Scenarios

  1. Install the CSI driver
  2. Remove the CSI Driver
  3. Dynamically provision a CSI Google File PV*
  4. Utilise a Google File PV
  5. Assess optional features such as resize & snapshot

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Customers::

  • Telefonica Spain
  • Deutsche Bank

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As an OCP user, I want images for GCP Filestore CSI Driver and Operator, so that I can install them on my cluster and utilize GCP Filestore shares.

Epic Goal

  • Enable the migration from a storage intree driver to a CSI based driver with minimal impact to the end user, applications and cluster
  • These migrations would include, but are not limited to:
    • CSI driver for AWS EBS
    • CSI driver for GCP
    • CSI driver for Azure (file and disk)
    • CSI driver for VMware vSphere

Why is this important?

  • OpenShift needs to maintain it's ability to enable PVCs and PVs of the main storage types
  • CSI Migration is getting close to GA, we need to have the feature fully tested and enabled in OpenShift
  • Upstream intree drivers are being deprecated to make way for the CSI drivers prior to intree driver removal

Scenarios

  1. User initiated move to from intree to CSI driver
  2. Upgrade initiated move from intree to CSI driver
  3. Upgrade from EUS to EUS

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

On new installations, we should make the StorageClass created by the CSI operator the default one. 

However, we shouldn't do that on an upgrade scenario. The main reason is that users might have set  a different quota on the CSI driver Storage Class.

Exit criteria:

  • New clusters get the CSI Storage Class as the default one.
  • Existing clusters don't get their default Storage Classes changed.

This Epic tracks the GA of this feature

Epic Goal

Why is this important?

  • OpenShift needs to maintain it's ability to enable PVCs and PVs of the main storage types
  • CSI Migration is getting close to GA, we need to have the feature fully tested and enabled in OpenShift
  • Upstream intree drivers are being deprecated to make way for the CSI drivers prior to intree driver removal

Scenarios

  1. User initiated move to from intree to CSI driver
  2. Upgrade initiated move from intree to CSI driver
  3. Upgrade from EUS to EUS

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

On new installations, we should make the StorageClass created by the CSI operator the default one. 

However, we shouldn't do that on an upgrade scenario. The main reason is that users might have set  a different quota on the CSI driver Storage Class.

Exit criteria:

  • New clusters get the CSI Storage Class as the default one.
  • Existing clusters don't get their default Storage Classes changed.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Rebase OpenShift components to k8s v1.24

Why is this important?

  • Rebasing ensures components work with the upcoming release of Kubernetes
  • Address tech debt related to upstream deprecations and removals.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. k8s 1.24 release

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

  • As an infrastructure owner, I want a repeatable method to quickly deploy the initial OpenShift cluster.
  • As an infrastructure owner, I want to install the first (management, hub, “cluster 0”) cluster to manage other (standalone, hub, spoke, hub of hubs) clusters.

Goals

  • Enable customers and partners to successfully deploy a single “first” cluster in disconnected, on-premises settings

Requirements

4.11 MVP Requirements

  • Customers and partners needs to be able to download the installer
  • Enable customers and partners to deploy a single “first” cluster (cluster 0) using single node, compact, or highly available topologies in disconnected, on-premises settings
  • Installer must support advanced network settings such as static IP assignments, VLANs and NIC bonding for on-premises metal use cases, as well as DHCP and PXE provisioning environments.
  • Installer needs to support automation, including integration with third-party deployment tools, as well as user-driven deployments.
  • In the MVP automation has higher priority than interactive, user-driven deployments.
  • For bare metal deployments, we cannot assume that users will provide us the credentials to manage hosts via their BMCs.
  • Installer should prioritize support for platforms None, baremetal, and VMware.
  • The installer will focus on a single version of OpenShift, and a different build artifact will be produced for each different version.
  • The installer must not depend on a connected registry; however, the installer can optionally use a previously mirrored registry within the disconnected environment.

Use Cases

  • As a Telco partner engineer (Site Engineer, Specialist, Field Engineer), I want to deploy an OpenShift cluster in production with limited or no additional hardware and don’t intend to deploy more OpenShift clusters [Isolated edge experience].
  • As a Enterprise infrastructure owner, I want to manage the lifecycle of multiple clusters in 1 or more sites by first installing the first  (management, hub, “cluster 0”) cluster to manage other (standalone, hub, spoke, hub of hubs) clusters [Cluster before your cluster].
  • As a Partner, I want to package OpenShift for large scale and/or distributed topology with my own software and/or hardware solution.
  • As a large enterprise customer or Service Provider, I want to install a “HyperShift Tugboat” OpenShift cluster in order to offer a hosted OpenShift control plane at scale to my consumers (DevOps Engineers, tenants) that allows for fleet-level provisioning for low CAPEX and OPEX, much like AKS or GKE [Hypershift].
  • As a new, novice to intermediate user (Enterprise Admin/Consumer, Telco Partner integrator, RH Solution Architect), I want to quickly deploy a small OpenShift cluster for Poc/Demo/Research purposes.

Questions to answer…

  •  

Out of Scope

Out of scope use cases (that are part of the Kubeframe/factory project):

  • As a Partner (OEMs, ISVs), I want to install and pre-configure OpenShift with my hardware/software in my disconnected factory, while allowing further (minimal) reconfiguration of a subset of capabilities later at a different site by different set of users (end customer) [Embedded OpenShift].
  • As an Infrastructure Admin at an Enterprise customer with multiple remote sites, I want to pre-provision OpenShift centrally prior to shipping and activating the clusters in remote sites.

Background, and strategic fit

  • This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  1. The user has only access to the target nodes that will form the cluster and will boot them with the image presented locally via a USB stick. This scenario is common in sites with restricted access such as government infra where only users with security clearance can interact with the installation, where software is allowed to enter in the premises (in a USB, DVD, SD card, etc.) but never allowed to come back out. Users can't enter supporting devices such as laptops or phones.
  2. The user has access to the target nodes remotely to their BMCs (e.g. iDrac, iLo) and can map an image as virtual media from their computer. This scenario is common in data centers where the customer provides network access to the BMCs of the target nodes.
  3. We cannot assume that we will have access to a computer to run an installer or installer helper software.

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

References

 

 

Epic Goal

As an OpenShift infrastructure owner, I want to deploy a cluster zero with RHACM or MCE and have the required components installed when the installation is completed

Why is this important?

BILLI makes it easier to deploy a cluster zero. BILLI users know at installation time what the purpose of their cluster is when they plan the installation. Day-2 steps are necessary to install operators and users, especially when automating installations, want to finish the installation flow when their required components are installed.

Acceptance Criteria

  • A user can provide MCE manifests and have it installed without additional manual steps after the installation is completed
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a customer, I want to be able to:

  • Install MCE with the agent-installer

so that I can achieve

  • create an MCE hub with my openshift install

Acceptance Criteria:

Description of criteria:

  • Upstream documentation including examples of the extra manifests needed
  • Unit tests that include MCE extra manifests
  • Ability to install MCE using agent-installer is tested
  • Point 3

(optional) Out of Scope:

We are only allowing the user to provide extra manifests to install MCE at this time. We are not adding an option to "install mce" on the command line (or UI)

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a customer, I want to be able to:

  • Install MCE with the agent-installer

so that I can achieve

  • create an MCE hub with my openshift install

Acceptance Criteria:

Description of criteria:

  • Upstream documentation including examples of the extra manifests needed
  • Unit tests that include MCE extra manifests
  • Ability to install MCE using agent-installer is tested
  • Point 3

(optional) Out of Scope:

We are only allowing the user to provide extra manifests to install MCE at this time. We are not adding an option to "install mce" on the command line (or UI)

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Epic Goal

As a OpenShift infrastructure owner, I want to deploy OpenShift clusters with dual-stack IPv4/IPv6

As a OpenShift infrastructure owner, I want to deploy OpenShift clusters with single-stack IPv6

Why is this important?

IPv6 and dual-stack clusters are requested often by customers, especially from Telco customers. Working with dual-stack clusters is a requirement for many but also a transition into a single-stack IPv6 clusters, which for some of our users is the final destination.

Acceptance Criteria

  • Agent-based installer can deploy IPv6 clusters
  • Agent-based installer can deploy dual-stack clusters
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Previous Work

Karim's work proving how agent-based can deploy IPv6: IPv6 deploy with agent based installer]

Done Checklist * CI - CI is running, tests are automated and merged.

  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>|

For dual-stack installations the agent-cluster-install.yaml must have both an IPv4 and IPv6 subnet in the networkking.MachineNetwork or assisted-service will throw an error. This field is in InstallConfig but it must be added to agent-cluster-install in its Generate().

For IPv4 and IPv6 installs, setting up the MachineNetwork is not needed but it also does not cause problems if its set, so it should be fine to set it all times.

Set the ClusterDeployment CRD to deploy OpenShift in FIPS mode and make sure that after deployment the cluster is set in that mode

In order to install FIPS compliant clusters, we need to make sure that installconfig + agentoconfig based deployments take into account the FIPS config in installconfig.

This task is about passing the config to agentclusterinstall so it makes it into the iso. Once there, AGENT-374 will give it to assisted service

Epic Goal

  • Rebase cluster autoscaler on top of Kubernetes 1.25

Why is this important?

  • Need to pick up latest upstream changes

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As a user I would like to see all the events that the autoscaler creates, even duplicates. Having the CAO set this flag will allow me to continue to see these events.

Background

We have carried a patch for the autoscaler that would enable the duplication of events. This patch can now be dropped because the upstream added a flag for this behavior in https://github.com/kubernetes/autoscaler/pull/4921

Steps

  • add the --record-duplicated-events flag to all autoscaler deployments from the CAO

Stakeholders

  • openshift eng

Definition of Done

  • autoscaler continues to work as expected and produces events for everything
  • Docs
  • this does not require documentation as it preserves existing behavior and provides no interface for user interaction
  • Testing
  • current tests should continue to pass

Feature Overview

Add GA support for deploying OpenShift to IBM Public Cloud

Goals

Complete the existing gaps to make OpenShift on IBM Cloud VPC (Next Gen2) General Available

Requirements

Optional requirements

  • OpenShift can be deployed using Mint mode and STS for cloud provider credentials (future release, tbd)
  • OpenShift can be deployed in disconnected mode https://issues.redhat.com/browse/SPLAT-737)
  • OpenShift on IBM Cloud supports User Provisioned Infrastructure (UPI) deployment method (future release, 4.14?)

Epic Goal

  • Enable installation of private clusters on IBM Cloud. This epic will track associated work.

Why is this important?

  • This is required MVP functionality to achieve GA.

Scenarios

  1. Install a private cluster on IBM Cloud.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background and Goal

Currently in OpenShift we do not support distributing hotfix packages to cluster nodes. In time-sensitive situations, a RHEL hotfix package can be the quickest route to resolving an issue. 

Acceptance Criteria

  1. Under guidance from Red Hat CEE, customers can deploy RHEL hotfix packages to MachineConfigPools.
  2. Customers can easily remove the hotfix when the underlying RHCOS image incorporates the fix.

Before we ship OCP CoreOS layering in https://issues.redhat.com/browse/MCO-165 we need to switch the format of what is currently `machine-os-content` to be the new base image.

The overall plan is:

  • Publish the new base image as `rhel-coreos-8` in the release image
  • Also publish the new extensions container (https://github.com/openshift/os/pull/763) as `rhel-coreos-8-extensions`
  • Teach the MCO to use this without also involving layering/build controller
  • Delete old `machine-os-content`

As a OCP CoreOS layering developer, having telemetry data about number of cluster using osImageURL will help understand how broadly this feature is getting used and improve accordingly.

Acceptance Criteria:

  • Cluster using Custom osImageURL is available via telemetry

After https://github.com/openshift/os/pull/763 is in the release image, teach the MCO how to use it. This is basically:

  • Schedule the extensions container as a kubernetes service (just serves a yum repo via http)
  • Change the MCD to write a file into `/etc/yum.repos.d/machine-config-extensions.repo` that consumes it instead of what it does now in pulling RPMs from the mounted container filesystem

Goal: Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.

Problem: There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.

Why is this important: Increased operational simplicity and scale flexibility of the cluster’s control plane deployment.

 

See slack working group: #wg-ctrl-plane-resize

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Background

We recently finished working on the CPMS main controller. We now need a way to generate the CPMS object for a cluster based on its current state.

We want a new controller that based on the cluster state generates a CPMS object and creates it against the API of the cluster. The CPMS object that will be created will have the `.spec.state` field set to `Inactive`, meaning that the main CPMS controller won't do any modification to the cluster based on that configuration until the user has reviewed/modified it and activated it by setting it to `Active`.

To catch changes in the cluster state the CPMS generator controller will keep watching the state and perform CPMS object regenerations checking if the object needs updating. If that's the case the generator will delete and recreate the CPMS object. This will constantly happen while the CPMS state is set to `Inactive`. As soon as the user activates it (sets `.spec.state` to `Active`). The generator controller will stop processing and its logic will become a no-op.

For now the generator controller will only be able to generate config for AWS clusters.

We should make sure to include envtest based testing to give us confidence when making changes to the code in the future.

Steps

  • Add the CPMS generator controller logic
  • Write up a suite of tests to validate the behaviour of the controller

Stakeholders

  • Cluster Infra
  • Service Delivery

Definition of Done

  • Test coverage for the tool is >80%
  • Docs
  • N/A
  • Testing
  • N/A

 

Why?

  • Decouple control and data plane. 
    • Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.
  • Improve security
    • Shift credentials out of cluster that support the operation of core platform vs workload
  • Improve cost
    • Allow a user to toggle what they don’t need.
    • Ensure a smooth path to scale to 0 workers and upgrade with 0 workers.

 

Assumption

  • A customer will be able to associate a cluster as “Infrastructure only”
  • E.g. one option: management cluster has role=master, and role=infra nodes only, control planes are packed on role=infra nodes
  • OR the entire cluster is labeled infrastructure , and node roles are ignored.
  • Anything that runs on a master node by default in Standalone that is present in HyperShift MUST be hosted and not run on a customer worker node.

 

 

Doc: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit 

Overview 

Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.

Assumption

  • A customer will be able to associate a cluster as “Infrastructure only”
  • E.g. one option: management cluster has role=master, and role=infra nodes only, control planes are packed on role=infra nodes
  • OR the entire cluster is labeled infrastructure, and node roles are ignored.
  • Anything that runs on a master node by default in Standalone that is present in HyperShift MUST be hosted and not run on a customer worker node.

DoD 

cluster-snapshot-controller-operator is running on the CP. 

More information here: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit 

As HyperShift Cluster Instance Admin, I want to run cluster-csi-snapshot-controller-operator in the management cluster, so the guest cluster runs just my applications.

  • Add a new cmdline option for the guest cluster kubeconfig file location
  • Parse both kubeconfigs:
    • One from projected service account, which leads to the management cluster.
    • Second from the new cmdline option introduced above. This one leads to the guest cluster.
  • Move creation of manifests/08_webhook_service.yaml from CVO to the operator - it needs to be created in the management cluster.
  • Tag manifests of objects that should not be deployed by CVO in HyperShift by
  • Only on HyperShift:
    • When interacting with Kubernetes API, carefully choose the right kubeconfig to watch / create / update objects in the right cluster.
    • Replace namespaces in all Deployments and other objects that are created in the management cluster. They must be created in the same namespace as the operator.
    • Don’t create operand’s PodDisruptionBudget?
    • Update ValidationWebhookConfiguration to point directly to URL exposed by manifests/08_webhook_service.yaml instead of a Service. The Service is not available in the guest cluster.
    • Pass only the guest kubeconfig to the operands (both the webhook and csi-snapshot-controller).
    • Update unit tests to handle two kube clients.

Exit criteria:

  • cluster-csi-snapshot-controller-operator runs in the management cluster in HyperShift
  • csi-snapshot-controller runs in the management cluster in HyperShift
  • It is possible to take & restore volume snapshot in the guest cluster.
  • No regressions in standalone OCP.

As OpenShift developer I want cluster-csi-snapshot-controller-operator to use existing controllers in library-go, so I don’t need to maintain yet another code that does the same thing as library-go.

  • Check and remove manifests/03_configmap.yaml, it does not seem to be useful.
  • Check and remove manifests/03_service.yaml, it does not seem to be useful (at least now).
  • Use DeploymentController from library-go to sync Deployments.
  • Get rid of common/ package? It does not seem to be useful.
  • Use StaticResourceController for static content, including the snapshot CRDs.

Note: if this refactoring introduces any new conditions, we must make sure that 4.11 snapshot controller clears them to support downgrade! This will need 4.11 BZ + z-stream update!

Similarly, if some conditions become obsolete / not managed by any controller, they must be cleared by 4.12 operator.

Exit criteria:

  • The operator code is smaller.
  • No regressions in standalone OCP.
  • Upgrade/downgrade from/to standalone OCP 4.11 works.

Epic Goal

  • To improve debug-ability of ovn-k in hypershift
  • To verify the stability of of ovn-k in hypershift
  • To introduce a EgressIP reach-ability check that will work in hypershift

Why is this important?

  • ovn-k is supposed to be GA in 4.12. We need to make sure it is stable, we know the limitations and we are able to debug it similar to the self hosted cluster.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. This will need consultation with the people working on HyperShift

Previous Work (Optional):

  1. https://issues.redhat.com/browse/SDN-2589

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Overview 

Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.

Assumption

  • A customer will be able to associate a cluster as “Infrastructure only”
  • E.g. one option: management cluster has role=master, and role=infra nodes only, control planes are packed on role=infra nodes
  • OR the entire cluster is labeled infrastructure, and node roles are ignored.
  • Anything that runs on a master node by default in Standalone that is present in HyperShift MUST be hosted and not run on a customer worker node.

DoD 

Run cluster-storage-operator (CSO) + AWS EBS CSI driver operator + AWS EBS CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster.

More information here: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit 

 

As HyperShift Cluster Instance Admin, I want to run AWS EBS CSI driver operator + control plane of the CSI driver in the management cluster, so the guest cluster runs just my applications.

  • Add a new cmdline option for the guest cluster kubeconfig file location
  • Parse both kubeconfigs:
    • One from projected service account, which leads to the management cluster.
    • Second from the new cmdline option introduced above. This one leads to the guest cluster.
  • Only on HyperShift:
    • When interacting with Kubernetes API, carefully choose the right kubeconfig to watch / create / update objects in the right cluster.
    • Replace namespaces in all Deployments and other objects that are created in the management cluster. They must be created in the same namespace as the operator.
  •  
  •  
    • Pass only the guest kubeconfig to the operand (control-plane Deployment of the CSI driver).

Exit criteria:

  • Control plane Deployment of AWS EBS CSI driver runs in the management cluster in HyperShift.
  • Storage works in the guest cluster.
  • No regressions in standalone OCP.

As OCP support engineer I want the same guest cluster storage-related objects in output of "hypershift dump cluster --dump-guest-cluster" as in "oc adm must-gather ", so I can debug storage issues easily.

 

must-gather collects: storageclasses persistentvolumes volumeattachments csidrivers csinodes volumesnapshotclasses volumesnapshotcontents

hypershift collects none of this, the relevant code is here: https://github.com/openshift/hypershift/blob/bcfade6676f3c344b48144de9e7a36f9b40d3330/cmd/cluster/core/dump.go#L276

 

Exit criteria:

  • verify that hypershift dump cluster --dump-guest-cluster has storage objects from the guest cluster.

As HyperShift Cluster Instance Admin, I want to run cluster-storage-operator (CSO) in the management cluster, so the guest cluster runs just my applications.

  • Add a new cmdline option for the guest cluster kubeconfig file location
  • Parse both kubeconfigs:
    • One from projected service account, which leads to the management cluster.
    • Second from the new cmdline option introduced above. This one leads to the guest cluster.
  • Tag manifests of objects that should not be deployed by CVO in HyperShift
  • Only on HyperShift:
    • When interacting with Kubernetes API, carefully choose the right kubeconfig to watch / create / update objects in the right cluster.
    • Replace namespaces in all Deployments and other objects that are created in the management cluster. They must be created in the same namespace as the operator.
    • Pass only the guest kubeconfig to the operands (AWS EBS CSI driver operator).

Exit criteria:

  • CSO and AWS EBS CSI driver operator runs in the management cluster in HyperShift
  • Storage works in the guest cluster.
  • No regressions in standalone OCP.

Feature Overview

  • Support deploying OCP in “GCP Service Project” while networks are defined in “GCP Host Project”. 
  • Enable OpenShift IPI Installer to deploy OCP in “GCP Service Project” while networks are defined in “GCP Host Project”
  • “GCP Service Project” is from where the OpenShift installer is fired. 
  • “GCP host project” is the target project where the deployment of the OCP machines are done. 
  • Customer using shared VPC and have a distributed network spanning across the projects. 

Goals

  • As a user, I want to be able to deploy OpenShift on Google Cloud using XPN, where networks and other resources are deployed in a shared "Host Project" while the user bootstrap the installation from a "Sevice Project" so that I can follow Google's architecture best practices 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic Goal

  • Enable OpenShift IPI Installer to deploy OCP to a shared VPC in GCP.
  • The host project is where the VPC and subnets are defined. Those networks are shared to one or more service projects.
  • Objects created by the installer are created in the service project where possible. Firewall rules may be the only exception.
  • Documentation outlines the needed minimal IAM for both the host and service project.

Why is this important?

  • Shared VPC's are a feature of GCP to enable granular separation of duties for organizations that centrally manage networking but delegate other functions and separation of billing. This is used more often in larger organizations where separate teams manage subsets of the cloud infrastructure. Enterprises that use this model would also like to create IPI clusters so that they can leverage the features of IPI. Currently organizations that use Shared VPC's must use UPI and implement the features of IPI themselves. This is repetative engineering of little value to the customer and an increased risk of drift from upstream IPI over time. As new features are built into IPI, organizations must become aware of those changes and implement them themselves instead of getting them "for free" during upgrades.

Scenarios

  1. Deploy cluster(s) into service project(s) on network(s) shared from a host project.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a developer, I want to be able to:

  • specify a project for the public and private DNS managedZones

so that I can achieve

  • enable DNS zones in alternate projects, such as the GCP XPN Host Project

Acceptance Criteria:

Description of criteria:

  • cluster-ingress-operator can parse the project and zone name from the following format
    • projects/project-id/managedZones/zoneid
  • cluster-ingress-operator continues to accept names that are not relative resource names
    • zoneid

(optional) Out of Scope:

All modifications to the openshift-installer is handled in other cards in the epic.

Engineering Details:

Feature Overview

  • Azure is sunsetting the Azure Active Directory Graph API on June 2022. The OpenShift installer and the in-cluster cloud-credential-operator (CCO) make use of this API. The replacement api is the Microsoft Graph API. Microsoft has not committed to providing a production-ready Golang SDK for the new Microsoft Graph API before June 2022.

Goals

  • Replace the existing AD Graph API for Azure we use for the Installer and Cluster components with the new Microsoft Authentication Library and Microsoft Graph API

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

This description is based on the Google Doc by Rafael Fonseca dos Santos : https://docs.google.com/document/d/1yQt8sbknSmF_hriHyMAKPiztSoRIvntSX9i1wtObSYs

 

Microsoft is deprecating two APIs. The AD Graph API used by Installer destroy code and also used by the CCO to mint credentials. ADAL is also going EOL. ADAL is used by the installer and all cluster components that authenticate to Azure:

Azure Active Directory Authentication Library (ADAL) Retirement **  

ADAL end-of-life is December 31, 2022. While ADAL apps may continue to work, no support or security fixes will be provided past end-of-life. In addition, there are no planned ADAL releases planned prior to end-of-life for features or planned support for new platform versions. We recommend prioritizing migration to Microsoft Authentication Library (MSAL). 

Azure AD Graph API  

Azure AD Graph will continue to function until June 30, 2023. This will be three years after the initial deprecation[ announcement.|https://techcommunity.microsoft.com/t5/microsoft-entra-azure-ad-blog/update-your-applications-to-use-microsoft-authentication-library/ba-p/1257363] Based on Azure deprecation[ guidelines|https://docs.microsoft.com/en-us/lifecycle/], we reserve the right to retire Azure AD Graph at any time after June 30, 2023, without advance notice. Though we reserve the right to turn it off after June 30, 2023, we want to ensure all customers migrate off and discourage applications from taking production dependencies on Azure AD Graph. Investments in new features and functionalities will only be made in[ Microsoft Graph|https://docs.microsoft.com/en-us/graph/overview]. Going forward, we will continue to support Azure AD Graph with security-related fixes. We recommend prioritizing migration to Microsoft Graph.

https://techcommunity.microsoft.com/t5/microsoft-entra-azure-ad-blog/microsoft-entra-change-announcements-september-2022-train/ba-p/2967454

https://learn.microsoft.com/en-us/answers/questions/768833/when-is-adal-and-azure-ad-graph-reaching-end-of-li.html

Takeaways / considerations

  • The new Microsoft Authentication Library (MSAL) that we will migrate to requires a new API permission: Graph API ReadWrite.OwnedBy (relevant [slack thread|https://coreos.slack.com/archives/C68TNFWA2/p1644009342019649?thread_ts=1644008944.461989&cid=C68TNFWA2)]. The old ReadWrite.OwnedBy API permissions could be removed to test as well.
  • Mint mode was discontinued in Azure, but clusters may exist that have cluster-created service principals from before the retirement. In that case, the service principals will either need to be deleted manually or with a newer version of the installer that has support for MSAL.
  • Migration to the new API (see Migration Guide below) entails using the azidentity package. The azidentity package is intended for use with V2 versions of the azure sdk for go, an adapter is required if the SDK packages have not been upgraded to V2, which is the case for our codebase. Only recently have V2 packages become stable. See references below.
  • Furthermore, azidentity is tied to Go 1.18, which affects our ability to backport prior to 4.11 or earlier versions.
  • Another consideration for backporting is that ADAL is used by the in-tree Azure cloud provider. These legacy cloud providers are generally closed for development, so an upstream patch seems unlikely, as does carrying a patch.
  • A path forward for the Azure cloud provider must be determined. Due to the legacy cloud providers freeze mentioned prior to this, it seems that the best path forward is for the out-of-tree provider and CCM, scheduled for 4.14: OCPCLOUD-1128, but even the upstream out-of-tree provider has not migrated yet: https://github.com/kubernetes-sigs/cloud-provider-azure/issues/430
  • AD FS (Active Directory Federation Services) are not yet supported in the Azure SDK for Go: https://github.com/AzureAD/microsoft-authentication-library-for-go/issues/31. There is a very limited user base for AD FS, but exactly how many users is unknown at this moment. Switching to the new API would break these users, so the best approach known at this moment would be to advise this extremely limited number of users to maintain the last supported version of OpenShift that uses ADAL until Microsoft introduces AD FS support. We do not document support for AD FS.

 

References:

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

1. Proposed title of this feature request
Show node-role.kubernetes.io in Motoring Dashboard to easily identify nodes by role

2. What is the nature and description of the request?
In Monitoring Dashboards such as /monitoring/dashboards/grafana-dashboard-k8s-resources-node, /monitoring/dashboards/grafana-dashboard-node-cluster-rsrc-use and /monitoring/dashboards/grafana-dashboard-node-rsrc-use it would helpful to add node-role.kubernetes.io to the OpenShift - Node selection and view to easily understand what role the OpenShift - Node has.

Especially in Cloud environments, it's hard to keep OpenShift - Node(s) separated and understand what the respective OpenShift - Node is doing. Showing the node-role.kubernetes.io could help here as it would allow to separate between important OpenShift - Nodes, such as Control-Plane Node, Infra or simple worker Node.

Including in addition for https://kubernetes.io/docs/reference/labels-annotations-taints/#failure-domainbetakubernetesiozone to select the different available Availability Zones (if available) would be great too, to potentially look at the data for a specific node role in a specific availability zone.

3. Why does the customer need this? (List the business requirements here)
For troubleshooting purpose and observability reason, it's often required to switch between CLI and console to understand what node-role and availability zone Administrator want to review. Having this all integrated in the Web-UI will improve observability and allow better filtering when looking into specific nodes, availability zones, etc.

4. List any affected packages or components.
OpenShift - Management Console

Epic Goal

Our dashboards should allow filtering data by node attributes.

We can facilitate this by extracting node labels from promql metrics and add an option to filter by these labels.

 

Why is this important?

From reporter: "In Monitoring Dashboards such as /monitoring/dashboards/grafana-dashboard-k8s-resources-node, /monitoring/dashboards/grafana-dashboard-node-cluster-rsrc-use and /monitoring/dashboards/grafana-dashboard-node-rsrc-use it would helpful to add node-role.kubernetes.io to the OpenShift - Node selection and view to easily understand what role the OpenShift - Node has."

> "For troubleshooting purpose and observability reason, it's often required to switch between CLI and console to understand what node-role and availability zone Administrator want to review. Having this all integrated in the Web-UI will improve observability and allow better filtering when looking into specific nodes, availability zones, etc."

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/OU-91

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 Monitoring Dashboards such as /monitoring/dashboards/grafana-dashboard-k8s-resources-node, /monitoring/dashboards/grafana-dashboard-node-cluster-rsrc-use and /monitoring/dashboards/grafana-dashboard-node-rsrc-use should have a node role filter.

This role can be extracted from a label using label_replace

I have not made any changes to the Node Exporter / USE Method / Cluster dashboard, as discussed in the parent epic: https://issues.redhat.com/browse/MON-2845?focusedCommentId=21681945&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21681945.

OLM would have to support a mechanism like podAffinity which allows multiple architecture values to be specified which enables it to pin operators to the matching architecture worker nodes

Ref: https://github.com/openshift/enhancements/pull/1014

 

Cut a new release of the OLM API and update OLM API dependency version (go.mod) in OLM package; then
Bring the upstream changes from OLM-2674 to the downstream olm repo.

A/C:

 - New OLM API version release
 - OLM API dependency updated in OLM Project
 - OLM Subscription API changes  downstreamed
 - OLM Controller changes  downstreamed
 - Changes manually tested on Cluster Bot

Epic Goal

  • Enabling integration of single hub cluster to install both ARM and x86 spoke clusters
  • Enabling support for heterogeneous OCP clusters
  • document requirements deployment flows
  • support in disconnected environment

Why is this important?

  • clients request

Scenarios

  1. Users manage both ARM and x86 machines, we should not require to have two different hub clusters
  2. Users manage a mixed architecture clusters without requirement of all the nodes to be of the same architecture

Acceptance Criteria

  • Process is well documented
  • we are able to install in a disconnected environment

We have a set of images

  • quay.io/edge-infrastructure/assisted-installer-agent:latest
  • quay.io/edge-infrastructure/assisted-installer-controller:latest
  • quay.io/edge-infrastructure/assisted-installer:latest

that should become multiarch images. This should be done both in upstream and downstream.

As a reference, we have built internally those images as multiarch and made them available as

  • registry.redhat.io/rhai-tech-preview/assisted-installer-agent-rhel8:latest
  • registry.redhat.io/rhai-tech-preview/assisted-installer-reporter-rhel8:latest
  • registry.redhat.io/rhai-tech-preview/assisted-installer-rhel8:latest

They can be consumed by the Assisted Serivce pod via the following env

    - name: AGENT_DOCKER_IMAGE
      value: registry.redhat.io/rhai-tech-preview/assisted-installer-agent-rhel8:latest
    - name: CONTROLLER_IMAGE
      value: registry.redhat.io/rhai-tech-preview/assisted-installer-reporter-rhel8:latest
    - name: INSTALLER_IMAGE
      value: registry.redhat.io/rhai-tech-preview/assisted-installer-rhel8:latest

Feature Overview

We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.

Goals

  • Feature enhancements (performance, scale, configuration, UX, ...)
  • Modernization (incorporation and productization of new technologies)

Requirements

  • Core Networking Stability
  • Core Networking Performance and Scale
  • Core Neworking Extensibility (Multus CNIs)
  • Core Networking UX (Observability)
  • Core Networking Security and Compliance

In Scope

  • Network Edge (ingress, DNS, LB)
  • SDN (CNI plugins, openshift-sdn, OVN, network policy, egressIP, egress Router, ...)
  • Networking Observability

Out of Scope

There are definitely grey areas, but in general:

  • CNV
  • Service Mesh
  • CNF

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Goal: Provide queryable metrics and telemetry for cluster routes and sharding in an OpenShift cluster.

Problem: Today we test OpenShift performance and scale with best-guess or anecdotal evidence for the number of routes that our customers use. Best practices for a large number of routes in a cluster is to shard, however we have no visibility with regard to if and how customers are using sharding.

Why is this important? These metrics will inform our performance and scale testing, documented cluster limits, and how customers are using sharding for best practice deployments.

Dependencies (internal and external):

Prioritized epics + deliverables (in scope / not in scope):

Not in scope:

Estimate (XS, S, M, L, XL, XXL):

Previous Work:

Open questions:

Acceptance criteria:

Epic Done Checklist:

  • CI - CI Job & Automated tests: <link to CI Job & automated tests>
  • Release Enablement: <link to Feature Enablement Presentation> 
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • Notes for Done Checklist
    • Adding links to the above checklist with multiple teams contributing; select a meaningful reference for this Epic.
    • Checklist added to each Epic in the description, to be filled out as phases are completed - tracking progress towards “Done” for the Epic.

Description:

As described in the Design Doc, the following information is needed to be exported from Cluster Ingress Operator:

  • Number of routes/shard

Design 2 will be implemented as part of this story.

 

Acceptance Criteria:

  • Support for exporting the above mentioned metrics by Cluster Ingress Operator

Description:

As described in the Metrics to be sent via telemetry section of the Design Doc, the following metrics is needed to be sent from OpenShift cluster to Red Hat premises:

  • Minimum Routes per Shard
    • Recording Rule – cluster:route_metrics_controller_routes_per_shard:min  : min(route_metrics_controller_routes_per_shard)
    • Gives the minimum value of Routes per Shard.
  • Maximum Routes per Shard
    • Recording Rule – cluster:route_metrics_controller_routes_per_shard:max  : max(route_metrics_controller_routes_per_shard)
    • Gives the maximum value of Routes per Shard.
  • Average Routes per Shard
    • Recording Rule – cluster:route_metrics_controller_routes_per_shard:avg  : avg(route_metrics_controller_routes_per_shard)
    • Gives the average value of Routes per Shard.
  • Median Routes per Shard
    • Recording Rule – cluster:route_metrics_controller_routes_per_shard:median  : quantile(0.5, route_metrics_controller_routes_per_shard)
    • Gives the median value of Routes per Shard.
  • Number of Routes summed by TLS Termination type
    • Recording Rule – cluster:openshift_route_info:tls_termination:sum : sum (openshift_route_info) by (tls_termination)
    • Gives the number of Routes for each tls_termination value. The possible values for tls_termination are edge, passthrough and reencrypt. 

The metrics should be allowlisted on the cluster side.

The steps described in Sending metrics via telemetry are needed to be followed. Specifically step 5.

Depends on CFE-478.

Acceptance Criteria:

  • Support for sending the above mentioned metrics from OpenShift clusters to the Red Hat premises by allowlisting metrics on the cluster side

This is a epic bucket for all activities surrounding the creation of declarative approach to release and maintain OLM catalogs.

Epic Goal

  • Allow Operator Authors to easily change the layout of the update graph in a single location so they can version/maintain/release it via git and have more approachable controls about graph vertices than today's replaces, skips and/or skipRange taxonomy
  • Allow Operators authors to have control over channel and bundle channel membership

Why is this important?

  • The imperative catalog maintenance approach so far with opm is being moved to a declarative format (OLM-2127 and OLM-1780) moving away from bundle-level controls but the update graph properties are still attached to a bundle
  • We've received feedback from the RHT internal developer community that maintaining and reasoning about the graph in the context of a single channel is still too hard, even with visualization tools
  • making the update graph easily changeable is important to deliver on some of the promises of declarative index configuration
  • The current interface for declarative index configuration still relies on skips, skipRange and replaces to shape the graph on a per-bundle level - this is too complex at a certain point with a lot of bundles in channels, we need to something at the package level

Scenarios

  1. An Operator author wants to release a new version replacing the latest version published previously
  2. After additional post-GA testing an Operator author wants to establish a new update path to an existing released version from an older, released version
  3. After finding a bug post-GA an Operator author wants to temporarily remove a known to be problematic update path
  4. An automated system wants to push a bundle inbetween an existing update path as a result of an Operator (base) image rebuild (Freshmaker use case)
  5. A user wants to take a declarative graph definition and turn it into a graphical image for visually ensuring the graph looks like they want
  6. An Operator author wants to promote a certain bundle to an additional / different channel to indicate progress in maturity of the operator.

Acceptance Criteria

  • The declarative format has to be user readable and terse enough to make quick modifications
  • The declarative format should be machine writeable (Freshmaker)
  • The update graph is declared and modified in a text based format aligned with the declarative config
  • it has to be possible to add / removes edges at the leave of the graph (releasing/unpublishing a new version)
  • it has to be possible to add/remove new vertices between existing edges (releasing/retracting a new update path)
  • it has to be possible to add/remove new edges in between existing vertices (releasing/unpublishing a version inbetween, freshmaker user case)
  • it has to be possible to change the channel member ship of a bundle after it's published (channel promotion)
  • CI - MUST be running successfully with tests automated
  • it has to be possible to add additional metadata later to implement OLM-2087 and OLM-259 if required

Dependencies (internal and external)

  1. Declarative Index Config (OLM-2127)

Previous Work:

  1. Declarative Index Config (OLM-1780)

Related work

Open questions:

  1. What other manipulation scenarios are required?
    1. Answer: deprecation of content in the spirit of OLM-2087
    2. Answer: cross-channel update hints as described in OLM-2059 if that implementation requires it

 

When working on this Epic, it's important to keep in mind this other potentially related Epic: https://issues.redhat.com/browse/OLM-2276

 

enhance the veneer rendering to be able to read the input veneer data from stdin, via a pipe, in a manner similar to https://dev.to/napicella/linux-pipes-in-golang-2e8j

then the command could be used in a manner similar to many k8s examples like

```shell
opm alpha render-veneer semver -o yaml < infile > outfile
```

Upstream issue link: https://github.com/operator-framework/operator-registry/issues/1011

Jira Description

As an OPM maintainer, I want to downstream the PR for (OCP 4.12 ) and backport it to OCP 4.11 so that IIB will NOT be impacted by the changes when it upgrades the OPM version to use the next/future opm upstream release (v1.25.0).

Summary / Background

IIB(the downstream service that manages the indexes) uses the upstream version and if they bump the OPM version to the next/future (v1.25.0) release with this change before having the downstream images updated then: the process to manage the indexes downstream will face issues and it will impact the distributions. 

Acceptance Criteria

  • The changes in the PR are available for the releases which uses FBC -> OCP 4.11, 4.12

Definition of Ready

  • PRs merged into downstream OCP repos branches 4.11/4.12

Definition of Done

  • We checked that the downstream images are with the changes applied (i.e.: we can try to verify in the same way that we checked if the changes were in the downstream for the fix OLM-2639 )

We need to continue to maintain specific areas within storage, this is to capture that effort and track it across releases.

Goals

  • To allow OCP users and cluster admins to detect problems early and with as little interaction with Red Hat as possible.
  • When Red Hat is involved, make sure we have all the information we need from the customer, i.e. in metrics / telemetry / must-gather.
  • Reduce storage test flakiness so we can spot real bugs in our CI.

Requirements

Requirement Notes isMvp?
Telemetry   No
Certification   No
API metrics   No
     

Out of Scope

n/a

Background, and strategic fit
With the expected scale of our customer base, we want to keep load of customer tickets / BZs low

Assumptions

Customer Considerations

Documentation Considerations

  • Target audience: internal
  • Updated content: none at this time.

Notes

In progress:

  • CI flakes:
    • Configurable timeouts for e2e tests
      • Azure is slow and times out often
      • Cinder times out formatting volumes
      • AWS resize test times out

 

High prio:

  • Env. check tool for VMware - users often mis-configure permissions there and blame OpenShift. If we had a tool they could run, it might report better errors.
    • Should it be part of the installer?
    • Spike exists
  • Add / use cloud API call metrics
    • Helps customers to understand why things are slow
    • Helps build cop to understand a flake
      • With a post-install step that filters data from Prometheus that’s still running in the CI job.
    • Ideas:
      • Cloud is throttling X% of API calls longer than Y seconds
      • Attach / detach / provisioning / deletion / mount / unmount / resize takes longer than X seconds?
    • Capture metrics of operations that are stuck and won’t finish.
      • Sweep operation map from executioner???
      • Report operation metric into the highest bucket after the bucket threshold (i.e. if 10minutes is the last bucket, report an operation into this bucket after 10 minutes and don’t wait for its completion)?
      • Ask the monitoring team?
    • Include in CSI drivers too.
      • With alerts too

Unsorted

  • As the number of storage operators grows, it would be grafana board for storage operators
    • CSI driver metrics (from CSI sidecars + the driver itself  + its operator?)
    • CSI migration?
  • Get aggregated logs in cluster
    • They're rotated too soon
    • No logs from dead / restarted pods
    • No tools to combine logs from multiple pods (e.g. 3 controller managers)
  • What storage issues customers have? it was 22% of all issues.
    • Insufficient docs?
    • Probably garbage
  • Document basic storage troubleshooting for our supports
    • What logs are useful when, what log level to use
    • This has been discussed during the GSS weekly team meeting; however, it would be beneficial to have this documented.
  • Common vSphere errors, their debugging and fixing. 
  • Document sig-storage flake handling - not all failed [sig-storage] tests are ours

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

There is a new driver release 5.0.0 since the last rebase that includes snapshot support:

https://github.com/kubernetes-sigs/ibm-vpc-block-csi-driver/releases/tag/v5.0.0

Rebase the driver on v5.0.0 and update the deployments in ibm-vpc-block-csi-driver-operator.
There are no corresponding changes in ibm-vpc-node-label-updater since the last rebase.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

This includes ibm-vpc-node-label-updater!

(Using separate cards for each driver because these updates can be more complicated)

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.

This includes (but is not limited to):

  • Kubernetes:
    • client-go
    • controller-runtime
  • OCP:
    • library-go
    • openshift/api
    • openshift/client-go
    • operator-sdk

Operators:

  • aws-ebs-csi-driver-operator 
  • aws-efs-csi-driver-operator
  • azure-disk-csi-driver-operator
  • azure-file-csi-driver-operator
  • openstack-cinder-csi-driver-operator
  • gcp-pd-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • manila-csi-driver-operator
  • ovirt-csi-driver-operator
  • vmware-vsphere-csi-driver-operator
  • alibaba-disk-csi-driver-operator
  • ibm-vpc-block-csi-driver-operator
  • csi-driver-shared-resource-operator

 

  • cluster-storage-operator
  • csi-snapshot-controller-operator
  • local-storage-operator
  • vsphere-problem-detector

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The End of General support for vSphere 6.7 will be on October 15, 2022. So, vSphere 6.7 will be deprecated for 4.11.

We want to encourage vSphere customers to upgrade to vSphere 7 in OCP 4.11 since VMware is EOLing (general support) for vSphere 6.7 in Oct 2022.

We want the cluster Upgradeable=false + have a strong alert pointing to our docs / requirements.

related slack: https://coreos.slack.com/archives/CH06KMDRV/p1647541493096729

tldr: three basic claims, the rest is explanation and one example

  1. We cannot improve long term maintainability solely by fixing bugs.
  2. Teams should be asked to produce designs for improving maintainability/debugability.
  3. Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.

While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.

One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.

I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.

We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.


Relevant links:

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • To refactor various unit test in cluster-ingress-operator to align with desire unit test standards. The unit tests are in need of various clean up to meet the standards of the network edge such as:
    • Using t.run in all unit tests for sub-test capabilities
    • Removing extraneous test cases
    • Fixing incorrect error messages

Why is this important?

  • Maintaining standards in unit tests is important for the debug-ability of our code

Scenarios

  1. ...

Acceptance Criteria

  • Unit tests generally meet our software standards

Dependencies (internal and external)

  1.  

Previous Work (Optional):

  1. For shift week, Miciah provided a handful commits https://github.com/Miciah/cluster-ingress-operator/commits/gateway-api that was the motivation to create this epic. 

Open questions::

  1. N/A

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Refactor Test_desiredLoadBalancerService to match our unit test standards, remove extraneous test cases, and make it more readable/maintainable.

Test_desiredHttpErrorCodeConfigMap contains a section that has dead code when checking for expect == nil || actual == ||. Clean this up.

Also replace Ruby-style #{} syntax for string interpolation with Go string formats.

Unit tests names should be formatted with Test_Function name, so that the scope of the function (private or Public) can be preserved.

Go 1.16 added the new embed directive to go. This embed directive lets you natively (and trivially) compile your binary with static asset files.

The current go-bindata dependency that's used in both the Ingress and DNS operator's for yaml asset compilation could be dropped in exchange for the new go embed functionality. This would reduce our dependency count, remove the need for `bindata.go` (which is version controlled and constantly updated), and make our code easier to read. This switch would also reduce the overall lines of code in our repos.

Note that this may be applicable to OCP 4.8 if and when images are built with go 1.16.

OCP/Telco Definition of Done

Epic Template descriptions and documentation.

Epic Goal

Why is this important?

  • This regression is a major performance and stability issue and it has happened once before.

Drawbacks

  • The E2E test may be complex due to trying to determine what DNS pods are responding to DNS requests. This is straightforward using the chaos plugin.

Scenarios

  • CI Testing

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. SDN Team

Previous Work (Optional):

  1. N/A

Open questions::

  1. Where do these E2E test go? SDN Repo? DNS Repo?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Enable the chaos plugin https://coredns.io/plugins/chaos/ in our CoreDNS configuration so that we can use a DNS query to easily identify what DNS pods are responding to our requests.

Epic Goal

  • Change the default value for the spec.tuningOptions.maxConnections field in the IngressController API, which configures the HAProxy maxconn setting, to 50000 (fifty thousand).

Why is this important?

  • The maxconn setting constrains the number of simultaneous connections that HAProxy accepts. Beyond this limit, the kernel queues incoming connections. 
  • Increasing maxconn enables HAProxy to queue incoming connections intelligently.  In particular, this enables HAProxy to respond to health probes promptly while queueing other connections as needed.
  • The default setting of 20000 has been in place since OpenShift 3.5 was released in April 2017 (see BZ#1405440, commit, RHBA-2017:0884). 
  • Hardware capabilities have increased over time, and the current default is too low for typical modern machine sizes. 
  • Increasing the default setting improves HAProxy's performance at an acceptable cost in the common case. 

Scenarios

  1. As a cluster administrator who is installing OpenShift on typical hardware, I want OpenShift router to be tuned appropriately to take advantage of my hardware's capabilities.

Acceptance Criteria

  • CI is passing. 
  • The new default setting is clearly documented. 
  • A release note informs cluster administrators of the change to the default setting. 

Dependencies (internal and external)

  1. None.

Previous Work (Optional):

  1. The  haproxy-max-connections-tuning enhancement made maxconn configurable without changing the default.  The enhancement document details the tradeoffs in terms of memory for various settings of nbthreads and maxconn with various numbers of routes. 

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

Feature Overview

  • This Section:* High-Level description of the feature ie: Executive Summary
  • Note: A Feature is a capability or a well defined set of functionality that delivers business value. Features can include additions or changes to existing functionality. Features can easily span multiple teams, and multiple releases.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As a console user I want to have option to:

  • Restart Deployment
  • Retry latest DeploymentConfig if it failed

 

For Deployments we will add the 'Restart rollout' action button. This action will PATCH the Deployment object's 'spec.template.metadata.annotations' block, by adding 'openshift.io/restartedAt: <actual-timestamp>' annotation. This will restart the deployment, by creating a new ReplicaSet.

  • action is disabled if:
    • Deployment is paused

 

For DeploymentConfig we will add 'Retry rollout' action button.  This action will PATCH the latest revision of ReplicationController object's 'metadata.annotations' block by setting 'openshift.io/deployment/phase: "New"' and removing openshift.io/deployment.cancelled and openshift.io/deployment.status-reason.

  • action is enabled if:
    • latest revision of the ReplicationController resource is in Failed phase
  • action is disabled if:
    • latest revision of the ReplicationController resource is in Complete phase
    • DeploymentConfig does not have any rollouts
    • DeploymentConfigs is paused

 

Acceptance Criteria:

  • Add the 'Restart rollout' action button for the Deployment resource to both action menu and kebab menu
  • Add the 'Retry rollout' action button for the DeploymentConfig resource to both action menu and kebab menu

 

BACKGROUND:

OpenShift console will be updated to allow rollout restart deployment from the console itself.

Currently, from the OpenShift console, for the resource “deploymentconfigs” we can only start and pause the rollout, and for the resource “deployment” we can only resume the rollout. None of the resources (deployment & deployment config) has this option to restart the rollout. So, that is the reason why the customer wants this functionality to perform the same action from the CLI as well as the OpenShift console.

The customer wants developers who are not fluent with the oc tool and terminal utilities, can use the console instead of the terminal to restart deployment, just like we use to do it through CLI using the command “oc rollout restart deploy/<deployment-name>“.
Usually when developers change the config map that deployment uses they have to restart pods. Currently, the developers have to use the oc rollout restart deployment command. The customer wants the functionality to get this button/menu to perform the same action from the console as well.

Design
Doc: https://docs.google.com/document/d/1i-jGtQGaA0OI4CYh8DH5BBIVbocIu_dxNt3vwWmPZdw/edit

When OCP is performing cluster upgrade user should be notified about this fact.

There are two possibilities how to surface the cluster upgrade to the users:

  • Display a console notification throughout OCP web UI saying that the cluster is currently under upgrade.
  • Global notification throughout OCP web UI saying that the cluster is currently under upgrade.
  • Have an alert firing for all the users of OCP stating the cluster is undergoing an upgrade. 

 

AC:

  • Console-operator will create a ConsoleNotification CR when the cluster is being upgraded. Once the upgrade is done console-operator will remote that CR. These are the three statuses based on which we are determining if the cluster is being upgraded.
  • Add unit tests

 

Note: We need to decide if we want to distinguish this particular notification by a different color? ccing Ali Mobrem 

 

Created from: https://issues.redhat.com/browse/RFE-3024

As a developer, I want to make status.HostIP for Pods visible in the Pod details page of the OCP Web Console. Currently there is no way to view the node IP for a Pod in the OpenShift Web Console.  When viewing a Pod in the console, the field status.HostIP is not visible.

 

Acceptance criteria:

  • Make pod's HostIP field visible in the pod details page, similarly to PodIP field

Feature Overview

This feature is about reducing the complexity of the CAPI install system architecture which is needed for using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift
prerequisite work Goals
Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.

Background, and strategic fit

  • Initially CAPI did not meet the requirements for cluster/machine management that OCP had the project has moved on, and CAPI is a better fit now and also has better community involvement.
  • CAPI has much better community interaction than MAPI.
  • Other projects are considering using CAPI and it would be cleaner to have one solution
  • Long term it will allow us to add new features more easily in one place vs. doing this in multiple places.

Acceptance Criteria

There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open

Epic Goal

  • Rework the current flow for the installation of Cluster API components in OpenShift by addressing some of the criticalities of the current implementation

Why is this important?

  • We need to reduce complexity of the CAPI install system architecture
  • We need to improve the development, stability and maintainability of Standalone Cluster API on OpenShift
  • We need to make Cluster 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  •  

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1.  

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As an OpenShift engineer I want the CAPI Providers repositories to use the new generator tool so that they can independently generate CAPI Provider transport ConfigMaps

Background

Once the new CAPI manifests generator tool is ready, we want to make use of that directly from the CAPI Providers repositories so we can avoid storing the generated configuration centrally and independently apply that based on the running platform.

Steps

  • Install new CAPI manifest generator as a go `tool` to all the CAPI provider repositories
  • Setup a make target under the `/openshift/Makefile` to invoke the generator. Make it output the manifests under `/openshift/manifests`
  • Make sure `/openshift/manifests` is mapped to `/manifests` in the openshift/Dockerfile, so that the files are later picked up by CVO
  • Make sure the manifest generation works by triggering a manual generation
  • Check in the newly generated transport ConfigMap + Credential Requests (to let them be applied by CVO)

Stakeholders

  • <Who is interested in this/where did they request this>

Definition of Done

  • CAPI manifest generator tool is installed 
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

User Story

As an OpenShift engineer I want to be able to install the new manifest generation tool as a standalone tool in my CAPI Infra Provider repo to generate the CAPI Provider transport ConfigMap(s)

Background

Renaming of the CAPI Asset/Manifest generator from assets (generator) to manifest-gen, as it won't need to generate go embeddable assets anymore, but only manifests that will be referenced and applied by CVO

Steps

  • Removal of the `/assets` folder - we are moving away from embedded assets in favour of transport ConfigMaps
  • Renaming of the CAPI Asset/Manifest generator from assets (generator) to manifest-gen, as it won't need to generate go embeddable assets anymore, but only manifests that will be referenced and applied by CVO
  • Removal of the cluster-api-operator specific code from the assets generator - we are moving away from using the cluster-api-operator
  • Remove the assets generator specific references from the Makefiles/hack scripts - they won't be needed anymore as the tool will be referenced only from other repositories 
  • Adapting the new generator tool to be a standalone go module that can be installed as a tool in other repositories to generate manifests
  • Make sure to add CRDs and Conversion,Validation (also Mutation?) Webhooks to the generated transport ConfigMaps

Stakeholders

  • Cluster Infrastructure Team
  • ShiftStack Team (CAPO)

Definition of Done

  • Working and standalone installable generation tool

We are deprecating DeploymentConfig with Deployment in OpenShift because Deployment is the recommended way to deploy applications. Deployment is a more flexible and powerful resource that allows you to control the deployment of your applications more precisely. DeploymentConfig is a legacy resource that is no longer necessary. We will continue to support DeploymentConfig for a period of time, but we encourage you to migrate to Deployment as soon as possible.

Here are some of the benefits of using Deployment over DeploymentConfig:

  • Deployment is more flexible. You can specify the number of replicas to deploy, the image to deploy, and the environment variables to use.
  • Deployment is more powerful. You can use Deployment to roll out changes to your applications in a controlled manner.
  • Deployment is the recommended way to deploy applications. OpenShift will continue to improve Deployment and make it the best way to deploy applications.

We hope that you will migrate to Deployment as soon as possible. If you have any questions, please contact us.

Epic Goal

  • Make it possible to disable the DeploymentConfig and BuildConfig APIs, and associated controller logic.

 

Given the nature of this component (embedded into a shared api server and controller manager), this will likely require adding logic within those shared components to not enable specific bits of function when the build or DeploymentConfig capability is disabled, and watching the enabled capability set so that the components enable the functionality when necessary.

I would not expect us to split the components out of their existing location as part of this, though that is theoretically an option.

 

Why is this important?

  • Reduces resource footprint and bug surface area for clusters that do not need to utilize the DeploymentConfig or BuildConfig functionality, such as SNO and OKE.

Acceptance Criteria (Mandatory)

  • CI - MUST be running successfully with tests automated (we have an existing CI job that runs a cluster with all optional capabilities disabled.  Passing that job will require disabling certain deploymentconfig tests when the cap is disabled)
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. Cluster install capabilities

Previous Work (Optional):

  1. The optional cap architecture and guidance for adding a new capability is described here: https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md

Open questions::

None

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in production environment

Make the list of enabled/disable controllers in OAS reflect enabled/disabled capabilities.

Acceptance criteria:

  • OAS allows to specify a list of enabled/disabled APIs (e.g. watches, caches, ...)
  • OASO watches capabilities and generates the right configuration for OAS with enabled/disabled list of APIs
  • Documentation is properly updated

QE:

  • enabled/disable capabilities and validate a given API (DC, Builds, ...) is/is not managed by a cluster:
  • checking the OAS logs do/do not log entries about affected API(s)
  • DC/Builds objects are created/fail to be created

Epic Goal*

Provide a long term solution to SELinux context labeling in OCP.

 
Why is this important? (mandatory)

As of today when selinux is enabled, the PV's files are relabeled when attaching the PV to the pod, this can cause timeout when the PVs contains lot of files as well as overloading the storage backend.

https://access.redhat.com/solutions/6221251 provides few workarounds until the proper fix is implemented. Unfortunately these workaround are not perfect and we need a long term seamless optimised solution.

This feature tracks the long term solution where the PV FS will be mounted with the right selinux context thus avoiding to relabel every file.

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. Apply new context when there is none
  2. Change context of all files/folders when changing context
  3. RWO & RWX PVs
    1. ReadWriteOncePod PVs first
    2. RWX PV in a second phase

As we are relying on mount context there should not be any relabeling (chcon) because all files / folders will inherit the context from the mount context

More on design & scenarios in the KEP  and related epic STOR-1173

Dependencies (internal and external) (mandatory)

None for the core feature

However the driver will have to set SELinuxMountSupported to true in the CSIDriverSpec to enable this feature. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR
  • Documentation - STOR
  • QE - STOR
  • PX - 
  • Others -

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Epic Goal

Support upstream feature "SELinux relabeling using mount options (CSIDriver API change)"" in OCP as Beta, i.e. test it and have docs for it (unless it's Alpha upstream).

Summary: If Pod has defined SELinux context (e.g. it uses "resticted" SCC) and it uses ReadWriteOncePod PVC and CSI driver responsible for the volume supports this feature, kubelet + the CSI driver will mount the volume directly with the correct SELinux labels. Therefore CRI-O does not need to recursive relabel the volume and pod startup can be significantly faster. We will need a thorough documentation for this.

This upstream epic actually will be implemented by us!

Why is this important?

  • We get this upstream feature through Kubernetes rebase. We should ensure it works well in OCP and we have docs for it.

Upstream links

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. External: the feature is currently scheduled for Beta in Kubernetes 1.27, i.e. OCP 4.14, but it may change before Kubernetes 1.27 GA.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Add metrics described in the upstream KEP to telemetry, so we know how many clusters / Pod would be affected when we expose SELinux mount to all volume types.

We want:

  • volume_manager_selinux_container_errors_total + volume_manager_selinux_container_warnings_total + volume_manager_selinux_pod_context_mismatch_errors_total + volume_manager_selinux_pod_context_mismatch_warnings_total - these show user configuration error.
  • volume_manager_selinux_volume_context_mismatch_errors_total + volume_manager_selinux_volume_context_mismatch_warnings_total - these show Pods that use the same volume with different SELinux context. This will not be supported when we extend mounting with SELinux to all volume types.

Feature Overview

  • As a Cluster Administrator, I want to opt-out of certain operators at deployment time using any of the supported installation methods (UPI, IPI, Assisted Installer, Agent-based Installer) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
  • As a Cluster Administrator, I want to opt-in to previously-disabled operators (at deployment time) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
  • As a ROSA service administrator, I want to exclude/disable Cluster Monitoring when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — since I get cluster metrics from the control plane.  This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.
  • As a ROSA service administrator, I want to exclude/disable Ingress Operator when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — as I want to use my preferred load balancer (i.e. AWS load balancer).  This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.

Goals

  • Make it possible for customers and Red Hat teams producing OCP distributions/topologies/experiences to enable/disable some CVO components while still keeping their cluster supported.

Scenarios

  1. This feature must consider the different deployment footprints including self-managed and managed OpenShift, connected vs. disconnected (restricted and air-gapped), supported topologies (standard HA, compact cluster, SNO), etc.
  2. Enabled/disabled configuration must persist throughout cluster lifecycle including upgrades.
  3. If there's any risk/impact of data loss or service unavailability (for Day 2 operations), the System must provide guidance on what the risks are and let user decide if risk worth undertaking.

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This part of the overall multiple release Composable OpenShift (OCPPLAN-9638 effort), which is being delivered in multiple phases:

Phase 1 (OpenShift 4.11): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators

  • CORS-1873 Installer to allow users to select OpenShift components to be included/excluded
  • OTA-555 Provide a way with CVO to allow disabling and enabling of operators
  • OLM-2415 Make the marketplace operator optional
  • SO-11 Make samples operator optional
  • METAL-162 Make cluster baremetal operator optional
  • OCPPLAN-8286 CI Job for disabled optional capabilities

Phase 2 (OpenShift 4.12): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators

Phase 3 (OpenShift 4.13): OCPBU-117

  • OTA-554 Make oc aware of cluster capabilities
  • PSAP-741 Make Node Tuning Operator (including PAO controllers) optional

Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)

  • OCPBU-352 Make Ingress Operator optional
  • CCO-186 ccoctl support for credentialing optional capabilities
  • MCO-499 MCD should manage certificates via a separate, non-MC path (formerly IR-230 Make node-ca managed by CVO)
  • CNF-5642 Make cluster autoscaler optional
  • CNF-5643 - Make machine-api operator optional
  • WRKLDS-695 - Make DeploymentConfig API + controller optional
  • CNV-16274 OpenShift Virtualization on the Red Hat Application Cloud (not applicable)

Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)

  • CCO-186 ccoctl support for credentialing optional capabilities
  • MCO-499 MCD should manage certificates via a separate, non-MC path (formerly IR-230 Make node-ca managed by CVO)
  • CNF-5642 Make cluster autoscaler optional
  • CNF-5643 - Make machine-api operator optional
  • WRKLDS-695 - Make DeploymentConfig API + controller optional
  • CNV-16274 OpenShift Virtualization on the Red Hat Application Cloud (not applicable)
  • CNF-9115 - Leverage Composable OpenShift feature to make control-plane-machine-set optional
  • BUILD-565 - Make Build v1 API + controller optional
  • CNF-5647 Leverage Composable OpenShift feature to make image-registry optional (replaces IR-351 - Make Image Registry Operator optional)

Phase 5 (OpenShift 4.15): OCPSTRAT-421 (formerly OCPBU-519)

  • OCPVE-634 - Leverage Composable OpenShift feature to make olm optional
  • CCO-419 (OCPVE-629) - Leverage Composable OpenShift feature to make cloud-credential  optional

Phase 6 (OpenShift 4.16): OCPSTRAT-731

Phase 7 (OpenShift 4.17): OCPSTRAT-1308

  • IR-400 - Remove node-ca from CIRO*
  • CCO-493 Make Cloud Credential Operator optional for remaining providers and topologies (non-SNO topologies)
  • CNF-9116 Leverage Composable OpenShift feature to machine-auto-approver optional

References

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

 

 

Epic Goal

  • Remove node-ca from Cluster Image Registry Operator - its functionality is provided by the MCO

Why is this important?

  • To avoid potential issues, a single component should handle certificate distribution in OCP clusters

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • The registry should continue to work on Hypershift

Dependencies (internal and external)

  1.   HOSTEDCP-1160

Previous Work (Optional):

  1. IR-351

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Once the MCO team is done moving the node-ca functionality to the MCO (MCO-499), we need to remove the node-ca from CIRO.

ACCEPTANCE CRITERIA

  • New clusters provisioned with the registry installed come without the node-ca daemon deployed
  • Existing clusters after upgrade will have the node-ca daemon removed
  • Works on Hypershift

Feature

As an Infrastructure Administrator, I want to deploy OpenShift on vSphere with supervisor (aka Masters) and worker nodes (from a MachineSet) across multiple vSphere data centers and multiple vSphere clusters using full stack automation (IPI) and user provided infrastructure (UPI).

 

MVP

Install OpenShift on vSphere using IPI / UPI in multiple vSphere data centers (regions) and multiple vSphere clusters in 1 vCenter, all in the same IPv4 subnet (in the same physical location).

  • Kubernetes Region contains vSphere datacenter and (single) vCenter name
  • Kubernetes Zone contains vSphere cluster, resource pool, datastore, network (port group)

Out of scope

  • There are no support the conversion of a non-zonal configuration (i.e. an existing OpenShift installation without 1+ zones) to a zonal configuration (1+ zones), but zonal UPI installation by the Infrastructure Administrator is permitted.

Scenarios for consideration:

  • OpenShift in vSphere across different zones to avoid single points of failure, whereby each node is in different ESX clusters within the same vSphere datacenter, but in different networks.
  • OpenShift in vSphere across multiple vSphere datacenter, while ensuring workers and masters are spread across 2 different datacenter in different subnets. (RFE-845, RFE-459).

Acceptance criteria:

  • Ensure vSphere IPI can successfully be deployed with ODF across the 3 zones (vSphere clusters) within the same vCenter [like we do with AWS, GCP & Azure].
  • Ensure zonal configuration in vSphere using UPI is documented and tested.

References: 

Epic Goal*

We need SPLAT-594 to be reflected in our CSI driver operator to support vSphere topology of storage GA.
 
Why is this important? (mandatory)

See SPLAT-320.
 
Scenarios (mandatory) 

As user, I want to edit Infrastructure object after OCP installation (or upgrade) to update cluster topology, so all newly provisioned PVs will get the new topology labels.

(With vSphere topology GA, we expect that users will be allowed to edit Infrastructure and change the cluster topology after cluster installation.)
 
Dependencies (internal and external) (mandatory)

  • SPLAT: [vsphere] Support Multiple datacenters and clusters GA.

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

It's possible that Infrastructure will remain read-only. No code on Storage side is expected then.

Done - Checklist (mandatory)

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

When STOR-1145 is merged, make sure that these new metrics are reported via telemetry to us.

 

Guide: https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md/#sending-metrics-via-telemetry-step-by-step

Exit criteria:

  • verify that metrics are reported in telemetry? I am not sure we have capabilities to test that, all code will be in monitoring repos.

Feature Overview

Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

 
Goals

  • Functionality on GCP Tech Preview
  • inclusion in the cluster backups
  • flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

This epic covers the work to apply user defined labels GCP resources created for openshift cluster available as tech preview.

The user should be able to define GCP labels to be applied on the resources created during cluster creation by the installer and other operators which manages the specific resources. The user will be able to define the required tags/labels in the install-config.yaml while preparing with the user inputs for cluster creation, which will then be made available in the status sub-resource of Infrastructure custom resource which cannot be edited but will be available for user reference and will be used by the in-cluster operators for labeling when the resources are created.

Updating/deleting of labels added during cluster creation or adding new labels as Day-2 operation is out of scope of this epic.

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

Reference - https://issues.redhat.com/browse/RFE-2017

Enhancement proposed for GCP tags support in OCP, requires cluster-image-registry-operator to add gcp userTags available in the status sub resource of infrastructure CR, to the gcp storage resource created.

cluster-image-registry-operator uses the method createStorageAccount() to create storage resource which should be updated to add tags after resource creation.

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • UTs and e2e are added/updated

Enhancement proposed for GCP labels support in OCP, requires cluster-image-registry-operator to add gcp userLabels available in the status sub resource of infrastructure CR, to the gcp storage resource created.

cluster-image-registry-operator uses the method createStorageAccount() to create storage resource which should be updated to add labels.

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • UTs and e2e are added/updated

OC mirror is GA product as of Openshift 4.11 .

The goal of this feature is to solve any future customer request for new features or capabilities in OC mirror 

Epic Goal

  • Mirror to mirror operations and custom mirroring flows required by IBM CloudPak catalog management

Why is this important?

  • IBM needs additional customization around the actual mirroring of images to enable CloudPaks to fully adopt OLM-style operator packaging and catalog management
  • IBM CloudPaks introduce additional compute architectures, increasing the download volume by 2/3rds to day, we need the ability to effectively filter non-required image versions of OLM operator catalogs during filtering for other customers that only require a single or a subset of the available image architectures
  • IBM CloudPaks regularly run on older OCP versions like 4.8 which require additional work to be able to read the mirrored catalog produced by oc mirror

Scenarios

  1. Customers can use the oc utility and delegate the actual image mirror step to another tool
  2. Customers can mirror between disconnected registries using the oc utility
  3. The oc utility supports filtering manifest lists in the context of multi-arch images according to the sparse manifest list proposal in the distribution spec

Acceptance Criteria

  • Customers can use the oc utility to mirror between two different air-gapped environments
  • Customers can specify the desired computer architectures and oc mirror will create sparse manifest lists in the target registry as a result

Dependencies (internal and external)

Previous Work:

  1. WRKLDS-369
  2. Disconnected Mirroring Improvement Proposal

Related Work:

  1. https://github.com/opencontainers/distribution-spec/pull/310
  2. https://github.com/distribution/distribution/pull/3536
  3. https://docs.google.com/document/d/10ozLoV7sVPLB8msLx4LYamooQDSW-CAnLiNiJ9SER2k/edit?usp=sharing

Feature Overview (aka. Goal Summary)  

Description of problem:

Even though in 4.11 we introduced LegacyServiceAccountTokenNoAutoGeneration to be compatible with upstream K8s to not generate secrets with tokens when service accounts are created, today OpenShift still creates secrets and tokens that are used for legacy usage of openshift-controller as well as the image-pull secrets. 

 

Customer issues:

Customers see auto-generated secrets for service accounts which is flagged as a security risk. 

 

This Feature is to track the implementation for removing legacy usage and image-pull secret generation as well so that NO secrets are auto-generated when a Service Account is created on OpenShift cluster. 

 

Goals (aka. expected user outcomes)

NO Secrets to be auto-generated when creating service accounts 

Requirements (aka. Acceptance Criteria):

Following *secrets need to NOT be generated automatically with every Serivce account creation:*  

  1. ImagePullSecrets : This is needed for Kubelet to fetch registry credentials directly. Implementation needed for the following upstream feature.
    https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2133-kubelet-credential-providers/README.md
  2. Dockerconfig secrets: The openshift-controller-manager relies on the old token secrets and it creates them so that it's able to generate registry credentials for the SAs. There is a PR that was created to remove this https://github.com/openshift/openshift-controller-manager/pull/223.

 

 

 Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

Concerns/Risks: Replacing functionality of one of the openshift-controller used for controllers that's been in the code for a long time may impact behaviors that w

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Existing documentation needs to be clear on where we are today and why we are providing the above 2 credentials. Related Tracker: https://issues.redhat.com/browse/OCPBUGS-13226 

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

BU Priority Overview

Create custom roles for GCP with minimal set of required permissions.

Goals

Enable customers to better scope credential permissions and create custom roles on GCP that only include the minimum subset of what is needed for OpenShift.

State of the Business

Some of the service accounts that CCO creates, e.g. service account with role  roles/iam.serviceAccountUser provides elevated permissions that are not required/used by the requesting OpenShift components. This is because we use predefined roles for GCP that come with bunch of additional permissions. The goal is to create custom roles with only the required permissions. 

Execution Plans

TBD

 

These are phase 2 items from CCO-188

Moving items from other teams that need to be committed to for 4.13 this work to complete

Epic Goal

  • Request to build list of specific permissions to run openshift on GCP - Components grant roles, but we need more granularity - Custom roles now allow ability to do this compared to when permissions capabilities were originally written for GCP

Why is this important?

  • Some of the service accounts that CCO creates, e.g. service account with role  roles/iam.serviceAccountUser provides elevated permissions that are not required/used by the requesting OpenShift components. This is because we use predefined roles for GCP that come with bunch of additional permissions. The goal is to create custom roles with only the required permissions. 

Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Ingress Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.

The new GCP provider spec for credentials request CR is as follows:

type GCPProviderSpec struct {
   metav1.TypeMeta `json:",inline"`
   // PredefinedRoles is the list of GCP pre-defined roles
   // that the CredentialsRequest requires.
   PredefinedRoles []string `json:"predefinedRoles"`
   // Permissions is the list of GCP permissions required to
   // create a more fine-grained custom role to satisfy the
   // CredentialsRequest.
   // When both Permissions and PredefinedRoles are specified
   // service account will have union of permissions from
   // both the fields
   Permissions []string `json:"permissions"`
   // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions
   // have the necessary services enabled
   // +optional
   SkipServiceCheck bool `json:"skipServiceCheck,omitempty"`
} 

we can use the following command to check permissions associated with a GCP predefined role

gcloud iam roles describe <role_name>

 

The sample output for role roleViewer is as follows. The  permission are listed in "includedPermissions" field.

[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer

 

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Image Registry Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.

The new GCP provider spec for credentials request CR is as follows:

type GCPProviderSpec struct {
   metav1.TypeMeta `json:",inline"`
   // PredefinedRoles is the list of GCP pre-defined roles
   // that the CredentialsRequest requires.
   PredefinedRoles []string `json:"predefinedRoles"`
   // Permissions is the list of GCP permissions required to
   // create a more fine-grained custom role to satisfy the
   // CredentialsRequest.
   // When both Permissions and PredefinedRoles are specified
   // service account will have union of permissions from
   // both the fields
   Permissions []string `json:"permissions"`
   // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions
   // have the necessary services enabled
   // +optional
   SkipServiceCheck bool `json:"skipServiceCheck,omitempty"`
} 

we can use the following command to check permissions associated with a GCP predefined role

gcloud iam roles describe <role_name>

 

The sample output for role roleViewer is as follows. The  permission are listed in "includedPermissions" field.

[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

Epic Goal

  • The goal of this epic to begin the process of expanding support of OpenShift on ppc64le hardware to include IPI deployments against the IBM Power Virtual Server (PowerVS) APIs.

Why is this important?

The goal of this initiative to help boost adoption of OpenShift on ppc64le. This can be further broken down into several key objectives.

  • For IBM, furthering adopt of OpenShift will continue to drive adoption on their power hardware. In parallel, this can be used for existing customers to migrate their old power on-prem workloads to a cloud environment.
  • For the Multi-Arch team, this represents our first opportunity to develop an IPI offering on one of the IBM platforms. Right now, we depend on IPI on libvirt to cover our CI needs; however, this is not a supported platform for customers. PowerVS would address this caveat for ppc64le.
  • By bringing in PowerVS, we can provide customers with the easiest possible experience to deploy and test workloads on IBM architectures.
  • Customers already have UPI methods to solve their OpenShift on prem needs for ppc64le. This gives them an opportunity for a cloud based option, further our hybrid-cloud story.

Scenarios

  • As a user with a valid PowerVS account, I would like to provide those credentials to the OpenShift installer and get a full cluster up on IPI.

Technical Specifications

Some of the technical specifications have been laid out in MULTIARCH-75.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. Images are built in the RHCOS pipeline and pushed in the OVA format to the IBM Cloud.
  2. Installer is extended to support PowerVS as a new platform.
  3. Machine and cluster APIs are updated to support PowerVS.
  4. A terraform provider is developed against the PowerVS APIs.
  5. A load balancing strategy is determined and made available.
  6. Networking details are sorted out.

Open questions::

  1. Load balancing implementation?
  2. Networking strategy given the lack of virtual network APIs in PowerVS.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Recently, the image registry team decided with this change[1] that major cloud platforms cannot have `emptyDir` as the storage backend. IBMCloud uses ibmcos, which we would ideally need to do. There have been few issues identified with using ibmcos as is in the cluster image registry operator and some solutions identified here[2]. Basically, we would need the PowerVS platform to be supported for ibmcos and an API related to change to add resourceGroup in the infra API. This only affects 4.13 and is not an issue for 4.12.

 

[1] https://github.com/openshift/cluster-image-registry-operator/pull/820

[2] https://coreos.slack.com/archives/CFFJUNP6C/p1672910113386879?thread_ts=1672762737.174679&cid=CFFJUNP6C

Feature Overview

Allow to configure compute and control plane nodes on across multiple subnets for on-premise IPI deployments. With separating nodes in subnets, also allow using an external load balancer, instead of the built-in (keepalived/haproxy) that the IPI workflow installs, so that the customer can configure their own load balancer with the ingress and API VIPs pointing to nodes in the separate subnets.

Goals

I want to install OpenShift with IPI on an on-premise platform (high priority for bare metal and vSphere) and I need to distribute my control plane and compute nodes across multiple subnets.

I want to use IPI automation but I will configure an external load balancer for the API and Ingress VIPs, instead of using the built-in keepalived/haproxy-based load balancer that come with the on-prem platforms.

Background, and strategic fit

Customers require using multiple logical availability zones to define their architecture and topology for their datacenter. OpenShift clusters are expected to fit in this architecture for the high availability and disaster recovery plans of their datacenters.

Customers want the benefits of IPI and automated installations (and avoid UPI) and at the same time when they expect high traffic in their workloads they will design their clusters with external load balancers that will have the VIPs of the OpenShift clusters.

Load balancers can distribute incoming traffic across multiple subnets, which is something our built-in load balancers aren't able to do and which represents a big limitation for the topologies customers are designing.

While this is possible with IPI AWS, this isn't available with on-premise platforms installed with IPI (for the control plane nodes specifically), and customers see this as a gap in OpenShift for on-premise platforms.

Functionalities per Epic

 

Epic Control Plane with Multiple Subnets  Compute with Multiple Subnets Doesn't need external LB Built-in LB
NE-1069 (all-platforms)
NE-905 (all-platforms)
NE-1086 (vSphere)
NE-1087 (Bare Metal)
OSASINFRA-2999 (OSP)  
SPLAT-860 (vSphere)
NE-905 (all platforms)
OPNET-133 (vSphere/Bare Metal for AI/ZTP)
OSASINFRA-2087 (OSP)
KNIDEPLOY-4421 (Bare Metal workaround)
SPLAT-409 (vSphere)

Previous Work

Workers on separate subnets with IPI documentation

We can already deploy compute nodes on separate subnets by preventing the built-in LBs from running on the compute nodes. This is documented for bare metal only for the Remote Worker Nodes use case: https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#configure-network-components-to-run-on-the-control-plane_ipi-install-installation-workflow

This procedure works on vSphere too, albeit no QE CI and not documented.

External load balancer with IPI documentation

  1. Bare Metal: https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-post-installation-configuration.html#nw-osp-configuring-external-load-balancer_ipi-install-post-installation-configuration
  2. vSphere: https://docs.openshift.com/container-platform/4.11/installing/installing_vsphere/installing-vsphere-installer-provisioned.html#nw-osp-configuring-external-load-balancer_installing-vsphere-installer-provisioned

Scenarios

  1. vSphere: I can define 3 or more networks in vSphere and distribute my masters and workers across them. I can configure an external load balancer for the VIPs.
  2. Bare metal: I can configure the IPI installer and the agent-based installer to place my control plane nodes and compute nodes on 3 or more subnets at installation time. I can configure an external load balancer for the VIPs.

Acceptance Criteria

  • Can place compute nodes on multiple subnets with IPI installations
  • Can place control plane nodes on multiple subnets with IPI installations
  • Can configure external load balancers for clusters deployed with IPI with control plane and compute nodes on multiple subnets
  • Can configure VIPs to in external load balancer routed to nodes on separate subnets and VLANs
  • Documentation exists for all the above cases

 

Epic Goal

As an OpenShift installation admin I want to use the Assisted Installer, ZTP and IPI installation workflows to deploy a cluster that has remote worker nodes in subnets different from the local subnet, while my VIPs with the built-in load balancing services (haproxy/keepalived).

While this request is most common with OpenShift on bare metal, any platform using the ingress operator will benefit from this enhancement.

Customers using platform none run external load balancers and they won't need this, this is specific for platforms deployed via AI, ZTP and IPI.

Why is this important?

Customers and partners want to install remote worker nodes on day1. Due to the built-in network services we provide with Assisted Installer, ZTP and IPI that manage the VIP for ingress, we need to ensure that they remain in the local subnet where the VIPs are configured.

Previous Work

The bare metal IPI tam added a workflow that allows to place the VIPs in the masters. While this isn't an ideal solution, this is the only option documented:

Configuring network components to run on the control plane

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Make it possible to entirely disable the Ingress Operator by leveraging the OCPPLAN-9638 Composable OpenShift capability.

Why is this important?

  • For Managed OpenShift on AWS (ROSA), we use the AWS load balancer and don't need the Ingress operator.  Disabling the Ingress Operator will reduce our resource consumption on infra nodes for running OpenShift on AWS.
  • Customers want to be able to disable the Ingress Operator and use their own component.

Scenarios

  1. This feature must consider the different deployment footprints including self-managed and managed OpenShift, connected vs. disconnected (restricted and air-gapped), implications for auto scaling, chargeback/showback use scenarios, etc.
  2. Disabled configuration must persist throughout cluster lifecycle including upgrades

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

Links:

RFE: https://issues.redhat.com/browse/RFE-3395

Enhancement PR: https://github.com/openshift/enhancements/pull/1415

API PR: https://github.com/openshift/api/pull/1516

Ingress  Operator PR: https://github.com/openshift/cluster-ingress-operator/pull/950

Background

Feature Goal: Make it possible to entirely disable the Ingress Operator by leveraging the Composable OpenShift capability.

Epic Goal

Implement the ingress capability focusing on the HyperShift users.

Non-Goals

  • Fully implement the ingress capability on the standalone OpenShift.

Design

As described in the EP PR.

Why is this important?

  • For Managed OpenShift on AWS (ROSA), we use the AWS load balancer and don't need the Ingress operator. Disabling the Ingress Operator will reduce our resource consumption on infra nodes for running OpenShift on AWS.
  • Customers want to be able to disable the Ingress Operator and use their own component.

Scenarios

 # ...

Acceptance Criteria

 * Release Technical Enablement - Provide necessary release enablement details and documents.
 * Ingress Operator can be disabled on HyperShift.

  • Dependent operators and OpenShift components can tolerate the disabled ingress operator on HyperShift.

Dependencies (internal and external)

 # The install-config and ClusterVersion API have been updated with the capability feature.
 # The console operator.

Previous Work (Optional):

Open questions:

 #  

Done Checklist

 * CI - CI is running, tests are automated and merged.
 * Release Enablement <link to Feature Enablement Presentation>
 * DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
 * DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
 * DEV - Downstream build attached to advisory: <link to errata>
 * QE - Test plans in Polarion: <link or reference to Polarion>
 * QE - Automated tests merged: <link or reference to automated tests>
 * DOC - Downstream documentation merged: <link to meaningful PR>

Goal
The goal of this user story is to add the new (ingress) capability to the cluster operator's payload (manifests: CRDs, RBACs, deployment, etc.).

Out of scope

  • CRDs and operands created at runtime (assets: gateway CRDs, controllers, subscription. etc.)

Acceptance criteria

  • The ingress capability is known to the openshift installer.
  • The new capability does not introduce any new regression to the e2e tests.

Links

Feature Overview

  • As a Cluster Administrator, I want to opt-out of certain operators at deployment time using any of the supported installation methods (UPI, IPI, Assisted Installer, Agent-based Installer) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
  • As a Cluster Administrator, I want to opt-in to previously-disabled operators (at deployment time) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
  • As a ROSA service administrator, I want to exclude/disable Cluster Monitoring when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — since I get cluster metrics from the control plane.  This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.
  • As a ROSA service administrator, I want to exclude/disable Ingress Operator when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — as I want to use my preferred load balancer (i.e. AWS load balancer).  This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.

Goals

  • Make it possible for customers and Red Hat teams producing OCP distributions/topologies/experiences to enable/disable some CVO components while still keeping their cluster supported.

Scenarios

  1. This feature must consider the different deployment footprints including self-managed and managed OpenShift, connected vs. disconnected (restricted and air-gapped), supported topologies (standard HA, compact cluster, SNO), etc.
  2. Enabled/disabled configuration must persist throughout cluster lifecycle including upgrades.
  3. If there's any risk/impact of data loss or service unavailability (for Day 2 operations), the System must provide guidance on what the risks are and let user decide if risk worth undertaking.

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This part of the overall multiple release Composable OpenShift (OCPPLAN-9638 effort), which is being delivered in multiple phases:

Phase 1 (OpenShift 4.11): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators

  • CORS-1873 Installer to allow users to select OpenShift components to be included/excluded
  • OTA-555 Provide a way with CVO to allow disabling and enabling of operators
  • OLM-2415 Make the marketplace operator optional
  • SO-11 Make samples operator optional
  • METAL-162 Make cluster baremetal operator optional
  • OCPPLAN-8286 CI Job for disabled optional capabilities

Phase 2 (OpenShift 4.12): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators

Phase 3 (OpenShift 4.13): OCPBU-117

  • OTA-554 Make oc aware of cluster capabilities
  • PSAP-741 Make Node Tuning Operator (including PAO controllers) optional

Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)

  • CCO-186 ccoctl support for credentialing optional capabilities
  • MCO-499 MCD should manage certificates via a separate, non-MC path (formerly IR-230 Make node-ca managed by CVO)
  • CNF-5642 Make cluster autoscaler optional
  • CNF-5643 - Make machine-api operator optional
  • WRKLDS-695 - Make DeploymentConfig API + controller optional
  • CNV-16274 OpenShift Virtualization on the Red Hat Application Cloud (not applicable)
  • CNF-9115 - Leverage Composable OpenShift feature to make control-plane-machine-set optional

Phase 5 (OpenShift 4.15): OCPSTRAT-421 (formerly) OCPBU-519

  • OCPBU-352 Make Ingress Operator optional
  • BUILD-565 - Make Build v1 API + controller optional
  • OBSDA-242 Make Cluster Monitoring Operator optional
  • OCPVE-630 (formerly CNF-5647) Leverage Composable OpenShift feature to make image-registry optional (replaces IR-351 - Make Image Registry Operator optional)
  • CNF-9114 - Leverage Composable OpenShift feature to make olm optional
  • CNF-9118 - Leverage Composable OpenShift feature to make cloud-credential  optional
  • CNF-9119 - Leverage Composable OpenShift feature to make cloud-controller-manager optional

Phase 6 (OpenShift 4.16): OCPSTRAT-731

  • TBD

References

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

 

 

Epic Goal

  • Add an optional capability that allows disabling the image registry operator entirely

Why is this important?

It is already possibly to run a cluster with no instantiated image registry, but the image registry operator itself always runs.  This is an unnecessary use of resources for clusters that don't need/want a registry.  Making it possible to disable this will reduce the resource footprint as well as bug risks for clusters that don't need it, such as SNO and OKE.

 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated (we have an existing CI job that runs a cluster with all optional capabilities disabled.  Passing that job will require disabling certain image registry tests when the cap is disabled)
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1.  MCO-499 must be completed first because we still need the CA management logic running even if the image registry operator is not running.

Previous Work (Optional):

  1. The optional cap architecture and guidance for adding a new capability is described here: https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To enable the MCO to replace the node-ca, the registry operator needs to provide its own CAs in isolation.

Currently, the registry provides its own CAs via the "image-registry-certificates" configmap. This configmap is a merge of the service ca, storage ca, and additionalTrustedCA (from images.config.openshift.io/cluster).

Because the MCO already has access to additionalTrustedCA, the new secret does not need to contain it.

 

ACCEPTANCE CRITERIA

TBD

  1. Proposed title of this feature request:

Update ETCD datastore encryption to use AES-GCM instead of AES-CBC

2. What is the nature and description of the request?

The current ETCD datastore encryption solution uses the aes-cbc cipher. This cipher is now considered "weak" and is susceptible to padding oracle attack.  Upstream recommends using the AES-GCM cipher. AES-GCM will require automation to rotate secrets for every 200k writes.

The cipher used is hard coded. 

3. Why is this needed? (List the business requirements here).

Security conscious customers will not accept the presence and use of weak ciphers in an OpenShift cluster. Continuing to use the AES-CBC cipher will create friction in sales and, for existing customers, may result in OpenShift being blocked from being deployed in production. 

4. List any affected packages or components.

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

The Kube APIserver is used to set the encryption of data stored in etcd. See https://docs.openshift.com/container-platform/4.11/security/encrypting-etcd.html

 

Today with OpenShift 4.11 or earlier, only aescbc is allowed as the encryption field type. 

 

RFE-3095 is asking that aesgcm (which is an updated and more recent type) be supported. Furthermore RFE-3338 is asking for more customizability which brings us to how we have implemented cipher customzation with tlsSecurityProfile. See https://docs.openshift.com/container-platform/4.11/security/tls-security-profiles.html

 

 
Why is this important? (mandatory)

AES-CBC is considered as a weak cipher

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview (aka. Goal Summary)  

Support OpenShift installation in AWS Shared VPC [1] scenario where AWS infrastructure resources (at least the Private Hosted Zone) belong to an account separate from the cluster installation target account.

Goals (aka. expected user outcomes)

As a user I need to use a Shared VPC [1] when installing OpenShift on AWS into an existing VPC. Which will at least require the use of a preexisting Route53 hosted zone where I am not allowed the user "participant" of the shared VPC to automatically create Route53 private zones.

Requirements (aka. Acceptance Criteria):

The Installer is able to successfully deploy OpenShift on AWS with a Shared VPC [1], and the cluster is able to successfully pass osde2e testing. This will include at least the scenario when private hostedZone belongs to different account (Account A) than cluster resources (Account B)

[1] https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic —

Links:

Enhancement PR: https://github.com/openshift/enhancements/pull/1397 

API PR: https://github.com/openshift/api/pull/1460 

Ingress  Operator PR: https://github.com/openshift/cluster-ingress-operator/pull/928 

Background

Feature Goal: Support OpenShift installation in AWS Shared VPC scenario where AWS infrastructure resources (at least the Private Hosted Zone) belong to an account separate from the cluster installation target account.

The ingress operator is responsible for creating DNS records in AWS Route53 for cluster ingress. Prior to the implementation of this epic, the ingress operator doesn't have the capability to add DNS records into an existing Route 53 hosted zone in the shared VPC.

Epic Goal

  • Add support to the ingress operator for creating DNS records in preexisting Route53 private hosted zones for Shared VPC clusters

Non-Goals

  • Ingress operator support for day-2 operations (i.e. changes to the AWS IAM Role value after installation)  
  • E2E testing (will be handled by the Installer Team) 

Design

As described in the WIP PR https://github.com/openshift/cluster-ingress-operator/pull/928, the ingress operator will consume a new API field that contains the IAM Role ARN for configuring DNS records in the private hosted zone. If this field is present, then the ingress operator will use this account to create all private hosted zone records. The API fields will be described in the Enhancement PR.

The ingress operator code will accomplish this by defining a new provider implementation that wraps two other DNS providers, using one of them to publish records to the public zone and the other to publish records to the private zone.

External DNS Operator Impact

See NE-1299

AWS Load Balancer Operator (ALBO) Impact

See NE-1299

Why is this important?

  • Without this ingress operator support, OpenShift users are unable to create DNS records in a preexisting Route53 private hosted zone which means OpenShift users can't share the Route53 component with a Shared VPC
  • Shared VPCs are considers AWS best practice

Scenarios

  1. ...

Acceptance Criteria

  • Unit tests must be written and automatically run in CI (E2E tests will be handled by the Installer Team)
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Ingress Operator creates DNS Records in preexisting Route53 private hosted zones for shared VPC Clusters
  • Network Edge Team has reviewed all of the related enhancements and code changes for Route53 in Shared VPC Clusters

Dependencies (internal and external)

  1. Installer Team is adding the new API fields required for enabling sharing Route53 with in Shared VPCs in https://issues.redhat.com/browse/CORS-2613
  2. Testing this epic requires having access to two AWS account

Previous Work (Optional):

  1. Significant discussion was done in this thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1681997102492889?thread_ts=1681837202.378159&cid=C68TNFWA2
  1. Slack channel #tmp-xcmbu-114

Open questions:

  1.  

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

Goal: Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.

Problem: There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.

Why is this important: Increased operational simplicity and scale flexibility of the cluster’s control plane deployment.

 

See slack working group: #wg-ctrl-plane-resize

Epic Goal

  • Resolve the outstanding technical debt from the ControlPlaneMachineSet project

Why is this important?

  • We need to make sure the project is tested, documented and maintained going forward

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. OCPCLOUD-1372

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

Not every platform needs to support failure domains, for example baremetal.

In order to support generic platforms, we could introduce a generic providerConfig abstraction that can handle the case when the failureDomains is empty.

In this case, we never manipulate the providerSpec anyway so there's no need to have a platform specific providerConfig abstraction.

Steps

  • Create a generic providerConfig abstraction
  • Write tests for the generic providerConfig abstraction
  • Integrate this so that we use the generic abstraction when there are no failure domains
  • Test on a platform that doesn't support failure domains, eg vSphere

Stakeholders

  • Cluster Infra
  • Platforms who aren't AWS, GCP, Azure

Definition of Done

  • We can support generic platforms
  • Docs
  • Announce single zone support for any platform
  • Testing
  • Manual testing against single zone and other platforms such as vSphere

Background

Now that we have a CPMS generator controller, we need to document how the user will interact with the CPMS to activate it.

Steps

 

  • Document how to review the generated CPMS object
  • Document how to fix the most common cluster state inconsistencies to make the generated CPMS object a no-op on the cluster
  • Document how to Activate the CPMS object and what that means
  • Update the CPMS enhancement regarding the installation.

Stakeholders

  • Cluster Infra
  • Service Delivery

Definition of Done

  • Docs are merged
  • Docs
  • N/A
  • Testing
  • N/A

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

 

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Complete during New status.

 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal

  • The goal of this epic is to enable monitoring on the Windows nodes created by Windows Machine Config Operator(WMCO) in an OpenShift cluster.

Why is this important?

  • Monitoring is critical to identify issues with nodes, pods and containers running on the nodes. Monitoring enables users to make informed decisions, optimize performance, and strategically manage infrastructure costs. Implementation of this epic will ensure that we have a consistent user experience for monitoring across Linux and Windows.

Scenarios

Display console graphs for the following:

  1. Windows node metrics.
  2. Windows pod metrics.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated to test that the console graphs are present.
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. None

Previous Work :

  1. Enhancement: Monitoring-windows-nodes

Open questions::

      1. Specifics on testing framework- adding tests to https://github.com/openshift/console- pending story creation.

      2. Feasibility of https://issues.redhat.com/browse/WINC-530

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description

When inspecting a pod in the OCP console, the metrics tab shows a number of graphs. The networking graphs do not show data for Windows pods.

Engineering Details

Using windows exporter 0.24.0 I am getting `No datapoints found` for the query made by the console:
(sum(irate(container_network_receive_bytes_total{pod='win-webserver-685cd6c5cc-8298l'}[5m])) by (pod, namespace, interface)) + on(namespace,pod,interface) group_left(network_name) (pod_network_name_info)

There is data returned from the query `irate(container_network_receive_bytes_total{pod='windows-machine-config-operator-7c8bcc7b64-sjqxw'}[5m])`
Which makes me believe the error is due to pod_network_name_info not having data for the Windows pods I am looking at.

I'm confirming that by checking in the namespace the workloads are deployed to via the query: pod_network_name_info{namespace="openshift-windows-machine-config-operator"}
I only see metrics for the Linux pods in the namespace.

Looking into this it seems like these metrics are coming from https://github.com/openshift/network-metrics-daemon which runs on each Linux node, and creates a metric for applicable pods running on the node.

Acceptance Criteria

  • The pod network graphs shown in the OCP console are populated for Windows pods.

Pre-Work Objectives

Since some of our requirements from the ACM team will not be available for the 4.12 timeframe, the team should work on anything we can get done in the scope of the console repo so that when the required items are available in 4.13, we can be more nimble in delivering GA content for the Unified Console Epic.

Overall GA Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster. 
Why customers want this?

  1. Single interface to accomplish their tasks
  2. Consistent UX and patterns
  3. Easily accessible: One URL, one set of credentials

Why we want this?

  • Shared code -  improve the velocity of both teams and most importantly ensure consistency of the experience at the code level
  • Pre-built PF4 components
  • Accessibility & i18n
  • Remove barriers for enabling ACM

Phase 2 Goal: Productization of the united Console 

  1. Enable user to quickly change context from fleet view to single cluster view
    1. Add Cluster selector with “All Cluster” Option. “All Cluster” = ACM
    2. Shared SSO across the fleet
    3. Hub OCP Console can connect to remote clusters API
    4. When ACM Installed the user starts from the fleet overview aka “All Clusters”
  2. Share UX between views
    1. ACM Search —> resource list across fleet -> resource details that are consistent with single cluster details view
    2. Add Cluster List to OCP —> Create Cluster

As a developer I would like to disable clusters like *KS that we can't support for multi-cluster (for instance because we can't authenticate). The ManagedCluster resource has a vendor label that we can use to know if the cluster is supported.

cc Ali Mobrem Sho Weimer Jakub Hadvig 

UPDATE: 9/20/22 : we want an allow-list with OpenShift, ROSA, ARO, ROKS, and  OpenShiftDedicated

Acceptance criteria:

  • Investigate if console-operator should pass info about which cluster are supported and unsupported to the frontend
  • Unsupported clusters should not appear in the cluster dropdown
  • Unsupported clusters based off
    • defined vendor label
    • non 4.x ocp clusters

Feature Goal: Unify the management of cluster ingress with a common, open, expressive, and extensible API.

Why is this Important? Gateway API is the evolution of upstream Kubernetes Ingress APIs. The upstream project is part of Kubernetes, working under SIG-NETWORK. OpenShift is contributing to the development, building a leadership position, and preparing OpenShift to support Gateway API, with Istio as our supported implementation.

The plug-able nature of the implementation of Gateway API enables support for additional and optional 3rd-party Ingress technologies.

Functional Requirements

  • Add support for Istio as a Gateway API implementation.
    • NE-1105 Management by an operator (possibly cluster-ingress-operator, OSSM operator, or a new operator)
    • Feature parity with OpenShift Router, where appropriate.
      • NE-1096    Provide a solution to support re-encrypt in Gateway API
      • NE-1097    Provide a solution to support passthrough in Gateway API
      • NE-1098    Research and select OSSM Istio image that provides enough features
    • Performance parity evaluation of Envoy and HAProxy.
    • NE-1102    Add oc command line support for Gateway API objects
    • NE-1103    Evaluate idling support for Gateway API
  • Avoid conflict with partner solutions (such as F5). 
    • Provide a solution that partners could integrate with (reduce dependencies on Istio by assuming plugins)
  • Avoid conflict with integrations (such as GKE) for hybrid cloud use cases.
  • NE-1106 Advanced routing capabilities currently unavailable in OCP.
    • More powerful path-based routing.
    • Header-based routing
    • Traffic mirroring
    • Traffic splitting (single and multi cluster)
    • Other features, based on time constraints
      • NE-1000 Understand Gateway API listener collapsing and how Istio Gateway implements
      • NE-1016 Investigate and document External DNS integration with Gateway API
      • Non-HTTP types of traffic (arbitrary TCP/UDP).
         
         
  • Add Gateway API support with OSSM service mesh.
    • Avoid conflict between Istio for ingress use-cases and Istio for mesh use-cases.
    • NE-1074 and NE-1095 Enable a unified control plane for ingress and mesh. 
    • NE-1035 Determine what OSSM release (based on what Istio release)...
  • Add Gateway API support for serverless.

Non-Functional Requirements:

  • NE-1034 Installation
  • NE-1110 Documentation
  • Release technical enablement
  • OCP CI integration
  • Continued upstream development to mature Gateway API and Istio support for the same.

Open Questions:

  • Integration with HAProxy?
  • Gateway is more than Ingress 2.0, how do we align with other platform components such as serverless and service mesh to ensure we're providing a complete solution?

Documentation Considerations:

  • Explain the resource model
  • Explain roles and how they align to Gateway API resources
  • Explain the extension points and provide extension point examples.
  • Xref upstream docs.

User Story: As a cluster admin, I want to create a gatewayclass and a gateway, and OpenShift should configure Istio/Envoy with an LB and DNS, so that traffic can reach httproutes attached to the gateway.

The operator will be one of these (or some combination):

  • cluster-ingress-operator
  • OSSM operator
  • a new operator

Functionality includes DNS (NE-1107), LoadBalancer (NE-1108), , and other operations formerly performed by the cluster-ingress-operator for routers.

  • configures GWAPI subcomponents
    • Installs GWAPI Gateway CRD
  • installs Istio (if needed) when Gateway and GatewayClasses are created

Requires design document or enhancement proposal, breakdown into more specific stories.

(probably needs to be an Epic, will move things around later to accomodate that).

 

Out of scope for enhanced dev preview:

  • Unified Control Plane operations (NE-1095)
  • Installs RBAC that restricts who can configure Gateway and GatewayClasses 

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission. 

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn". 

Epic Goal

Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.

When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.

To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).

Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).

The following table tracks progress:

namespace in review merged
openshift-apiserver-operator PR  
openshift-authentication PR  
openshift-authentication-operator PR  
openshift-cloud-controller-manager    
openshift-cloud-controller-manager-operator    
openshift-cloud-credential-operator PR
openshift-cloud-network-config-controller PR  
openshift-cluster-csi-drivers PR1, PR2
openshift-cluster-machine-approver    
openshift-cluster-node-tuning-operator PR  
openshift-cluster-samples-operator PR
openshift-cluster-storage-operator PR1, PR2  
openshift-cluster-version PR
openshift-config-operator PR  
openshift-console PR  
openshift-console-operator PR  
openshift-controller-manager PR  
openshift-controller-manager-operator PR  
openshift-dns    
openshift-dns-operator    
openshift-etcd    
openshift-etcd-operator    
openshift-image-registry PR
openshift-ingress PR  
openshift-ingress-canary PR  
openshift-ingress-operator PR  
openshift-insights PR
openshift-kube-apiserver    
openshift-kube-apiserver-operator    
openshift-kube-controller-manager    
openshift-kube-controller-manager-operator    
openshift-kube-scheduler    
openshift-kube-scheduler-operator    
openshift-kube-storage-version-migrator PR  
openshift-kube-storage-version-migrator-operator PR  
openshift-machine-api PR1, PR2, PR3, PR4, PR5, PR6  
openshift-machine-config-operator PR  
openshift-marketplace    
openshift-monitoring    
openshift-multus    
openshift-network-diagnostics PR  
openshift-network-node-identity PR  
openshift-network-operator    
openshift-oauth-apiserver PR  
openshift-operator-lifecycle-manager PR
openshift-ovn-kubernetes    
openshift-route-controller-manager PR  
openshift-service-ca PR
openshift-service-ca-operator PR

Feature Overview

  • Customers want to create and manage OpenShift clusters using managed identities for Azure resources for authentication.

Goals

  • A customer using ARO wants to spin up an OpenShift cluster with "az aro create" without needing additional input, i.e. without the need for an AD account or service principal credentials, and the identity used is never visible to the customer and cannot appear in the cluster.
  • As an administrator, I want to deploy OpenShift 4 and run Operators on Azure using access controls (IAM roles) with temporary, limited privilege credentials.

Requirements

  • Azure managed identities must work for installation with all install methods including IPI and UPI, work with upgrades, and day-to-day cluster lifecycle operations.
  • Support HyperShift and non-HyperShift clusters.
  • Support use of Operators with Azure managed identities.
  • Support in all Azure regions where Azure managed identity is available. Note: Federated credentials is associated with Azure Managed Identity, and federated credentials is not available in all Azure regions.

More details at ARO managed identity scope and impact.

 

This Section: A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

References

Epic Goal

  • CIRO can consume azure workload identity tokens
  • CIRO's Azure credential request uses new API field for requesting permissions

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

 

ACCEPTANCE CRITERIA

  • image-registry uses latest openshift/docker-distribution
  • CIRO can detect when the creds it gets from CCO are for federated workload identity (the credentials secret will contain a "azure_federated_token_file")
  • when using federated workload identity, CIRO adds the "AZURE_FEDERATED_TOKEN_FILE" env var to the image-registry deployment
  • when using federated workload identity, CIRO does not add the "REGISTRY_STORAGE_AZURE_ACCOUNTKEY" env var to the image-registry deployment
  • the image-registry operates normally when using federated workload identity

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

 

ACCEPTANCE CRITERIA

  • CIRO should retrieve the "azure_resourcegroup" from the cluster Infrastructure object instead of the CCO created secret (this key will not be present when workload identity is in use)
  • CIRO's CredentialsRequest specifies the service account names (see the: cluster-storage-operator for an example)
  • CIRO is able to create storage accounts and containers when configured with azure workload identity.

Epic Goal

  • Build list of specific permissions to run Openshift on Azure - Components grant roles, but we need more granularity.
  • Determine and document the Azure roles and required permissions for Azure managed identity.

Why is this important?

  • Many of our customers have security policies in their organization that restrict credentials to only minimal permissions that conflict with the documented list of permissions needed for OpenShift. Customers need to know the explicit list of permissions minimally needed for deploying and running OpenShift and what they're used for so they can request the right permissions. Without this information, it can/will block adoption of OpenShift 4 in many cases.

Scenarios

  1. ...

Acceptance Criteria

  • Document explicit list of required credential permissions for installing (Day 1) OpenShift on Azure using the IPI and UPI deployment workflows and what each of the permissions are used for.
  • Document explicit list of required role and credential permissions for the operation (Day 2) of an OpenShift cluster on Azure and what each of the permissions are used for
  • Verify minimum list of permissions for Azure with IPI and UPI installation workflows
  • (Day 2) operations of OpenShift on Azure - MUST complete successfully with automated tests
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. Installer [both UPI & IPI Workflows]
  2. Control Plane
    • Kube Controller Manager
  3. Compute [Managed Identity]
  4. Cloud API enabled components
    • Cloud Credential Operator
    • Machine API
    • Internal Registry
    • Ingress
  5. ?
  6.  

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

 

Epic Overview

  • Enable customers to create and manage OpenShift clusters using managed identities for Azure resources for authentication.
  • A customer using ARO wants to spin up an OpenShift cluster with "az aro create" without needing additional input, i.e. without the need for an AD account or service principal credentials, and the identity used is never visible to the customer and cannot appear in the cluster.

Epic Goal

  • A customer creates an OpenShift cluster ("az aro create") using Azure managed identity.
  • Azure managed identities must work for installation with all install methods including IPI and UPI, work with upgrades, and day-to-day cluster lifecycle operations.
  • After Azure failed to implement workable golang API changes after deprecation of their old API, we have removed mint mode and work entirely in passthrough mode. Azure has plans to implement pod/workload identity similar to how they have been implemented in AWS and GCP, and when this feature is available, we should implement permissions similar to AWS/GCP
  • This work cannot start until Azure have implemented this feature - as such, this Epic is a placeholder to track the effort when available.

Why is this important?

  • Microsoft and the customer would prefer that we use Managed Identities vs. Service Principal (which requires putting the Service Principal and principal password in clear text within the azure.conf file).

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

 

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

Feature Overview

RHEL CoreOS should be updated to RHEL 9.2 sources to take advantage of newer features, hardware support, and performance improvements.

 

Requirements

  • RHEL 9.x sources for RHCOS builds starting with OCP 4.13 and RHEL 9.2.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

  • 9.2 Preview via Layering No longer necessary assuming we stay the course of going all in on 9.2

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

PROBLEM

We would like to improve our signal for RHEL9 readiness by increasing internal engineering engagement and external partner engagement on our community OpehShift offering, OKD.

PROPOSAL

Adding OKD to run on SCOS (a CentOS stream for CoreOS) brings the community offering closer to what a partner or an internal engineering team might expect on OCP.

ACCEPTANCE CRITERIA

Image has been switched/included: 

DEPENDENCIES

The SCOS build payload.

RELATED RESOURCES

OKD+SCOS proposal: https://docs.google.com/presentation/d/1_Xa9Z4tSqB7U2No7WA0KXb3lDIngNaQpS504ZLrCmg8/edit#slide=id.p

OKD+SCOS work draft: https://docs.google.com/document/d/1cuWOXhATexNLWGKLjaOcVF4V95JJjP1E3UmQ2kDVzsA/edit

 

Acceptance Criteria

A stable OKD on SCOS is built and available to the community sprintly.

 

This comes up when installing ipi-on-aws on arm64 with the custom payload build at quay.io/aleskandrox/okd-release:4.12.0-0.okd-centos9-full-rebuild-arm64 that is using scos as machine-content-os image

 

```

[root@ip-10-0-135-176 core]# crictl logs c483c92e118d8
2022-08-11T12:19:39+00:00 [cnibincopy] FATAL ERROR: Unsupported OS ID=scos
```

 

The probable fix has to land on https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/multus/multus.yaml#L41-L53

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Create a new platform type, working name "External", that will signify when a cluster is deployed on a partner infrastructure where core cluster components have been replaced by the partner. “External” is different from our current platform types in that it will signal that the infrastructure is specifically not “None” or any of the known providers (eg AWS, GCP, etc). This will allow infrastructure partners to clearly designate when their OpenShift deployments contain components that replace the core Red Hat components.

This work will require updates to the core OpenShift API repository to add the new platform type, and then a distribution of this change to all components that use the platform type information. For components that partners might replace, per-component action will need to be taken, with the project team's guidance, to ensure that the component properly handles the "External" platform. These changes will look slightly different for each component.

To integrate these changes more easily into OpenShift, it is possible to take a multi-phase approach which could be spread over a release boundary (eg phase 1 is done in 4.X, phase 2 is done in 4.X+1).

Phase 1

  • Write platform “External” enhancement.
  • Evaluate changes to cluster capability annotations to ensure coverage for all replaceable components.
  • Meet with component teams to plan specific changes that will allow for supplement or replacement under platform "External".

Phase 2

  • Update OpenShift API with new platform and ensure all components have updated dependencies.
  • Update capabilities API to include coverage for all replaceable components.
  • Ensure all Red Hat operators tolerate the "External" platform and treat it the same as "None" platform.

Phase 3

  • Update components based on identified changes from phase 1
    • Update Machine API operator to run core controllers in platform "External" mode.

Why is this important?

  • As partners begin to supplement OpenShift's core functionality with their own platform specific components, having a way to recognize clusters that are in this state helps Red Hat created components to know when they should expect their functionality to be replaced or supplemented. Adding a new platform type is a significant data point that will allow Red Hat components to understand the cluster configuration and make any specific adjustments to their operation while a partner's component may be performing a similar duty.
  • The new platform type also helps with support to give a clear signal that a cluster has modifications to its core components that might require additional interaction with the partner instead of Red Hat. When combined with the cluster capabilities configuration, the platform "External" can be used to positively identify when a cluster is being supplemented by a partner, and which components are being supplemented or replaced.

Scenarios

  1. A partner wishes to replace the Machine controller with a custom version that they have written for their infrastructure. Setting the platform to "External" and advertising the Machine API capability gives a clear signal to the Red Hat created Machine API components that they should start the infrastructure generic controllers but not start a Machine controller.
  2. A partner wishes to add their own Cloud Controller Manager (CCM) written for their infrastructure. Setting the platform to "External" and advertising the CCM capability gives a clear to the Red Hat created CCM operator that the cluster should be configured for an external CCM that will be managed outside the operator. Although the Red Hat operator will not provide this functionality, it will configure the cluster to expect a CCM.

Acceptance Criteria

Phase 1

  • Partners can read "External" platform enhancement and plan for their platform integrations.
  • Teams can view jira cards for component changes and capability updates and plan their work as appropriate.

Phase 2

  • Components running in cluster can detect the “External” platform through the Infrastructure config API
  • Components running in cluster react to “External” platform as if it is “None” platform
  • Partners can disable any of the platform specific components through the capabilities API

Phase 3

  • Components running in cluster react to the “External” platform based on their function.
    • for example, the Machine API Operator needs to run a set of controllers that are platform agnostic when running in platform “External” mode.
    • the specific component reactions are difficult to predict currently, this criteria could change based on the output of phase 1.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. Identifying OpenShift Components for Install Flexibility

Open questions::

  1. Phase 1 requires talking with several component teams, the specific action that will be needed will depend on the needs of the specific component. At the least the components need to treat platform "External" as "None", but there could be more changes depending on the component (eg Machine API Operator running non-platform specific controllers).

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

As defined in the  External platform enhancement , a new platform is being added to OpenShift. To accommodate the phase 2 work, the CIO should be updated, if necessary, to react to the "External" platform in the same manner as it would for platform "None".

Please see the  enhancement and the parent plan OCPBU-5 for more details about this process.

Why is this important?

In phase 2 (planned for 4.13 release) of the external platform enhancement, the new platform type will be added to the openshift/api packages. As part of staging the release of this new platform we will need to ensure that all operators react in a neutral way to the platform, as if it were a "None" platform to ensure the continued normal operation of OpenShift.

Scenarios

  1. As a user I would like to enable the External platform so that I can supplement OpenShift with my own container network options. To ensure proper operation of OpenShift, the cluster ingress operator should not react to the new platform or prevent my installation of the custom driver so that I can create clusters with my own topology.

Acceptance Criteria

We are working to create an External platform test which will exercise this mechanism, see OCPCLOUD-1782

Dependencies (internal and external)

  1. This will require OCPCLOUD-1777

Previous Work (Optional):

Open questions::

Done Checklist

  • CI Testing - we will perform manual test while waiting for OCPCLOUD-1782
  • Documentation - only developer docs need to be updated at this time
  • QE - test scenario should be covered by a cluster-wide install with the new platform type
  • Technical Enablement - n/a
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 
  • ** - Downstream documentation merged: <link to meaningful PR>

as described in the epic, the CIO should be updated to react to the new "External" platform as it would for a "None" platform.

Epic Goal

As defined in the  External platform enhancement , a new platform is being added to OpenShift. To accommodate the phase 2 work, the CIRO should be updated, if necessary, to react to the "External" platform in the same manner as it would for platform "None".

Please see the  enhancement and the parent plan OCPBU-5 for more details about this process.

Why is this important?

In phase 2 (planned for 4.13 release) of the external platform enhancement, the new platform type will be added to the openshift/api packages. As part of staging the release of this new platform we will need to ensure that all operators react in a neutral way to the platform, as if it were a "None" platform to ensure the continued normal operation of OpenShift.

Scenarios

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As a user I would like to enable the External platform so that I can supplement OpenShift with my own Image Registry options. To ensure proper operation of OpenShift, the cluster image registry operator should not react to the new platform or prevent my installation of the custom driver so that I can create clusters with my own topology.

Acceptance Criteria

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

We are working to create an External platform test which will exercise this mechanism, see OCPCLOUD-1782

Dependencies (internal and external)

This will require OCPCLOUD-1777

Previous Work (Optional):

Open questions::

Done Checklist

  • CI Testing - we will perform manual test while waiting for OCPCLOUD-1782
  • Documentation - only developer docs need to be updated at this time
  • QE - test scenario should be covered by a cluster-wide install with the new platform type
  • Technical Enablement - n/a
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

as described in the epic, the CIRO should be updated to react to the new "External" platform as it would for a "None" platform.

 

 
Goal:
API and implementation work to provide the cluster admin with an option in the IngressController API to use PROXY protocol with IBM Cloud load-balancers. 

Description:
This epic extends the IngressController API essentially by copying the option we added in NE-330.  In that epic, we added a configuration option to use PROXY protocol when configuring an IngresssController to use a NodePort service or host networking.  With this epic (NE-1090), the same configuration option is added to use PROXY protocol when configuring an IngressController to use a LoadBalancer service on IBM Cloud. 

 
This epic tracks the API and implementation work to provide the cluster admin with an option in the IngressController API to use PROXY protocol with IBM Cloud load-balancers. 

This epic extends the IngressController API essentially by copying the option we added in NE-330.  In that epic, we added a configuration option to use PROXY protocol when configuring an IngresssController to use a NodePort service or host networking.  With this epic (NE-1090), the same configuration option is added to use PROXY protocol when configuring an IngressController to use a LoadBalancer service on IBM Cloud. 

Feature Overview

Create a Azure cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in Azure) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

 
Goals

  • Functionality on Azure Tech Preview
  • inclusion in the cluster backups
  • flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

This epic covers the work to apply user defined tags to Azure created for openshift cluster available as tech preview.

The user should be able to define the azure tags to be applied on the resources created during cluster creation by the installer and other operators which manages the specific resources. The user will be able to define the required tags in the install-config.yaml while preparing with the user inputs for cluster creation, which will then be made available in the status sub-resource of Infrastructure custom resource which cannot be edited but will be available for user reference and will be used by the in-cluster operators for tagging when the resources are created.

Updating/deleting of tags added during cluster creation or adding new tags as Day-2 operation is out of scope of this epic.

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

Reference - https://issues.redhat.com/browse/RFE-2017

Enhancement proposed for Azure tags support in OCP, requires cluster-ingress-operator to add azure userTags available in the status sub resource of infrastructure CR, to the azure DNS resource created.

cluster-ingress-operator should add Tags to the DNS records created.

Note: dnsrecords.ingress.operator.openshift.io and openshift-ingress CRD, usage to be identified.

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • UTs and e2e are added/updated

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned components should be running on Kubernetes 1.27
  • This includes
    • The cluster autoscaler (+operator)
    • Machine API operator
      • Machine API controllers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cloud Controller Manager Operator
      • Cloud controller managers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cluster Machine Approver
    • Cluster API Actuator Package
    • Control Plane Machine Set Operator

Why is this important?

  • ...

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

Overview 

HyperShift came to life to serve multiple goals, some are main near-term, some are secondary that serve well long-term. 

Main Goals for hosted control planes (HyperShift)

  • Optimize OpenShift for Cost/footprint/ which improves our competitive stance against the *KSes
  • Establish separation of concerns which makes it more resilient for SRE to manage their workload clusters (be it security, configuration management, etc).
  • Simplify and enhance multi-cluster management experience especially since multi-cluster is becoming an industry need nowadays. 

Secondary Goals

HyperShift opens up doors to penetrate the market. HyperShift enables true hybrid (CP and Workers decoupled, mixed IaaS, mixed Arch,...). An architecture that opens up more options to target new opportunities in the cloud space. For more details on this one check: Hosted Control Planes (aka HyperShift) Strategy [Live Document]

 

Hosted Control Planes (HyperShift) Map 

To bring hosted control planes to our customers, we need the means to ship it. Today MCE is how HyperShift shipped, and installed so that customers can use it. There are two main customers for hosted-control-planes: 

 

  • Self-managed: In that case, Red Hat would provide hosted control planes as a service that is managed and SREed by the customer for their tenants (hence “self”-managed). In this management model, our external customers are the direct consumers of the multi-cluster control plane as a servie. Once MCE is installed, they can start to self-service dedicated control planes. 

 

  • Managed: This is OpenShift as a managed service, today we only “manage” the CP, and share the responsibility for other system components, more info here. To reduce management costs incurred by service delivery organizations which translates to operating profit (by reducing variable costs per control-plane), as well as to improve user experience, lower platform overhead (allow customers to focus mostly on writing applications and not concern themselves with infrastructure artifacts), and improve the cluster provisioning experience. HyperShift is shipped via MCE, and delivered to Red Hat managed SREs (same consumption route). However, for managed services, additional tooling needs to be refactored to support the new provisioning path. Furthermore, unlike self-managed where customers are free to bring their own observability stack, Red Hat managed SREs need to observe the managed fleet to ensure compliance with SLOs/SLIs/…

 

If you have noticed, MCE is the delivery mechanism for both management models. The difference between managed and self-managed is the consumer persona. For self-managed, it's the customer SRE for managed its the RH SRE

High-level Requirements

For us to ship HyperShift in the product (as hosted control planes) in either management model, there is a necessary readiness checklist that we need to satisfy. Below are the high-level requirements needed before GA: 

 

  • Hosted control planes fits well with our multi-cluster story (with MCE)
  • Hosted control planes APIs are stable for consumption  
  • Customers are not paying for control planes/infra components.  
  • Hosted control planes has an HA and a DR story
  • Hosted control planes is in parity with top-level add-on operators 
  • Hosted control planes reports metrics on usage/adoption
  • Hosted control planes is observable  
  • HyperShift as a backend to managed services is fully unblocked.

 

Please also have a look at our What are we missing in Core HyperShift for GA Readiness? doc. 

Hosted control planes fits well with our multi-cluster story

Multi-cluster is becoming an industry need today not because this is where trend is going but because it’s the only viable path today to solve for many of our customer’s use-cases. Below is some reasoning why multi-cluster is a NEED:

 

 

As a result, multi-cluster management is a defining category in the market where Red Hat plays a key role. Today Red Hat solves for multi-cluster via RHACM and MCE. The goal is to simplify fleet management complexity by providing a single pane of glass to observe, secure, police, govern, configure a fleet. I.e., the operand is no longer one cluster but a set, a fleet of clusters. 

HyperShift logically centralized architecture, as well as native separation of concerns and superior cluster lifecyle management experience, makes it a great fit as the foundation of our multi-cluster management story. 

Thus the following stories are important for HyperShift: 

  • When lifecycling OpenShift clusters (for any OpenShift form factor) on any of the supported providers from MCE/ACM/OCM/CLI as a Cluster Service Consumer  (RH managed SRE, or self-manage SRE/admin):
  • I want to be able to use a consistent UI so I can manage and operate (observe, govern,...) a fleet of clusters.
  • I want to specify HA constraints (e.g., deploy my clusters in different regions) while ensuring acceptable QoS (e.g., latency boundaries) to ensure/reduce any potential downtime for my workloads. 
  • When operating OpenShift clusters (for any OpenShift form factor) on any of the supported provider from MCE/ACM/OCM/CLI as a Cluster Service Consumer  (RH managed SRE, or self-manage SRE/admin):
  • I want to be able to backup any critical data so I am able to restore them in case of hosting service cluster (management cluster) failure. 

Refs:

Hosted control planes APIs are stable for consumption.

 

HyperShift is the core engine that will be used to provide hosted control-planes for consumption in managed and self-managed. 

 

Main user story:  When life cycling clusters as a cluster service consumer via HyperShift core APIs, I want to use a stable/backward compatible API that is less susceptible to future changes so I can provide availability guarantees. 

 

Ref: What are we missing in Core HyperShift for GA Readiness?

Customers are not paying for control planes/infra components. 

 

Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.

Assumptions

  • A customer will be able to associate a cluster as “Infrastructure only”
  • E.g. one option: management cluster has role=master, and role=infra nodes only, control planes are packed on role=infra nodes
  • OR the entire cluster is labeled infrastructure , and node roles are ignored.
  • Anything that runs on a master node by default in Standalone that is present in HyperShift MUST be hosted and not run on a customer worker node.

HyperShift - proposed cuts from data plane

HyperShift has an HA and a DR story

When operating OpenShift clusters (for any OpenShift form factor) from MCE/ACM/OCM/CLI as a Cluster Service Consumer  (RH managed SRE, or self-manage SRE/admin) I want to be able to migrate CPs from one hosting service cluster to another:

  • as means for disaster recovery in the case of total failure
  • so that scaling pressures on a management cluster can be mitigated or a management cluster can be decommissioned.

More information: 

 

Hosted control planes reports metrics on usage/adoption

To understand usage patterns and inform our decision making for the product. We need to be able to measure adoption and assess usage.

See Hosted Control Planes (aka HyperShift) Strategy [Live Document]

Hosted control plane is observable  

Whether it's managed or self-managed, it’s pertinent to report health metrics to be able to create meaningful Service Level Objectives (SLOs), alert of failure to meet our availability guarantees. This is especially important for our managed services path. 

HyperShift is in parity with top-level add-on operators

https://issues.redhat.com/browse/OCPPLAN-8901 

Unblock HyperShift as a backend to managed services

HyperShift for managed services is a strategic company goal as it improves usability, feature, and cost competitiveness against other managed solutions, and because managed services/consumption-based cloud services is where we see the market growing (customers are looking to delegate platform overhead). 

 

We should make sure our SD milestones are unblocked by the core team. 

 

Note 

This feature reflects HyperShift core readiness to be consumed. When all related EPICs and stories in this EPIC are complete HyperShift can be considered ready to be consumed in GA form. This does not describe a date but rather the readiness of core HyperShift to be consumed in GA form NOT the GA itself.

- GA date for self-managed will be factoring in other inputs such as adoption, customer interest/commitment, and other factors. 
- GA dates for ROSA-HyperShift are on track, tracked in milestones M1-7 (have a look at https://issues.redhat.com/browse/OCPPLAN-5771

Epic Goal*

The goal is to split client certificate trust chains from the global Hypershift root CA.

 
Why is this important? (mandatory)

This is important to:

  • assure a workload can be run on any kind of OCP flavor
  • reduce the blast radius in case of a sensitive material leak
  • separate trust to allow more granular control over client certificate authentication

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. I would like to be able to run my workloads on any OpenShift-like platform.
    My workloads allow components to authenticate using client certificates based
    on a trust bundle that I am able to retrieve from the cluster.
  1. I don't want my users to have access to any CA bundle that would allow them
    to trust a random certificate from the cluster for client certificate authentication.

 
Dependencies (internal and external) (mandatory)

Hypershift team needs to provide us with code reviews and merge the changes we are to deliver

Contributing Teams(and contacts) (mandatory) 

  • Development - OpenShift Auth, Hypershift
  • Documentation -OpenShift Auth Docs team
  • QE - OpenShift Auth QE
  • PX - I have no idea what PX is
  • Others - others

Acceptance Criteria (optional)

The serviceaccount CA bundle automatically injected to all pods cannot be used to authenticate any client certificate generated by the control-plane.

Drawbacks or Risk (optional)

Risk: there is a throbbing time pressure as this should be delivered before first stable Hypershift release

Done - Checklist (mandatory)

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

With the enablement of OpenShift clusters with mixed architecture capable compute nodes it is necessary to have support for manifest listed images so the correct images/binaries can be pulled onto the relevant nodes.

Included in this should be the ability to

  • use oc debug successfully on all node types
  • support manifest listed images in the internal registry
  • have the ability to import manifest listed images

Epic Goal

  • Complete manifest lists support on image streams. The work was initiated on epic IR-192

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

ACCEPTANCE CRITERIA

  • Pushing a manifest list to the image registry should result in an image stream created for the manifest list
  • Image objects should be created for the manifest list, as well as all of its sub-manifests
  • The Image object for the manifest list should contain references to all of its sub-manifests under the dockerImageManifests field
  • Pulling a sub-manifest of a manifest list by digest should work when the user has access to the image stream

 

Notes:

  • When a manifest is pushed by sha an Image object should be created
  • You can use `skopeo copy` with the `--all` flag to push a manifest list and all its sub-manifests to the registry
  • Authorization needs to work the same they do for images created via imagestreamimport

Feature Overview (aka. Goal Summary)  

Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.

 

Phase 2 Goal:  

  • Complete the design of the Cluster API (CAPI) architecture and build the core operator logic
  • attach and detach of load balancers for internal and external load balancers for control plane machines on AWS, Azure, GCP and other relevant platforms
  • manage the lifecycle of Cluster API components within OpenShift standalone clusters
  • E2E tests

for Phase-1, incorporating the assets from different repositories to simplify asset management.

 

Background, and strategic fit

  • Initially CAPI did not meet the requirements for cluster/machine management that OCP had the project has moved on, and CAPI is a better fit now and also has better community involvement.
  • CAPI has much better community interaction than MAPI.
  • Other projects are considering using CAPI and it would be cleaner to have one solution
  • Long term it will allow us to add new features more easily in one place vs. doing this in multiple places.

Acceptance Criteria

There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open

Epic Goal

  • To create an operator to manage the lifecycle of Cluster API components within OpenShift standalone clusters

Why is this important?

  • We need to be able to install and lifecycle the Cluster API ecosystem within standalone OpenShift
  • We need to make sure that we can update the components via an operator
  • We need to make sure that we can lifecycle the APIs via an operator

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

In the cluster-capi-operator repository are present several CAPI E2E tests for specific providers.

We run these tests on every PR that lands on that repository.

In order to test rebases for the cluster-api providers we want to run these tests also there to prove rebase PRs are not breaking CAPI functionality.

DoD:

  • Set techpreview openshift E2E jobs for the cluster-api providers to make sure the build doesn't break the TP payload
  • Set techpreview CAPI E2E jobs for cluster-api providers repositories for providers where an corresponding e2e test is present in the cluster-capi-operator
    • for now only AWS, GCP and IMBCloud have E2Es
  • Add a target/script in the cluster-api providers repositories for running the E2E CAPI tests, where it applies.

Feature Overview (aka. Goal Summary)  

Currently, SCCs are part of the OpenShift API and are subject to modifications by customers. This leads to a constant stream of issues:

  • Modifications of out-of-the-box SCCs cause core workloads to malfunction
  • Addition of new higher priority SCCs may overrule existing pinned out-of-the-box SCCs during SCC admission and cause core workloads to malfunction

Goals (aka. expected user outcomes)

  • Create a way to prevent SCC preemption and modifications of out-of-the-box SCCs  
  •  

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Summary (PM+lead)

Currently, SCCs are part of the OpenShift API and are subject to modifications by customers. This leads to a constant stream of issues:

  • Modifications of out-of-the-box SCCs may cause core workloads to malfunction
  • Addition of new higher priority SCCs may overrule existing pinned out-of-the-box SCCs during SCC admission and cause core workloads to malfunction

We need to find and implement schemes to protect core workloads while retaining the API guarantee for modifications of SCCs (unfortunately).

Motivation (PM+lead)

Goals (lead)

Non-Goals (lead)

Deliverables

Proposal (lead)

User Stories (PM)

Dependencies (internal and external, lead)

Previous Work (lead)

Open questions (lead)

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned CAPI components should be running on Kubernetes 1.27
  • target is 4.15 since CAPI is always a release behind upstream

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.15 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned components should be running on Kubernetes 1.27
  • This includes
    • The cluster autoscaler (+operator)
    • Machine API operator
      • Machine API controllers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cloud Controller Manager Operator
      • Cloud controller managers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cluster Machine Approver
    • Cluster API Actuator Package
    • Control Plane Machine Set Operator

Why is this important?

  • ...

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

Feature Overview

Extend OpenShift on IBM Cloud integration with additional features to pair the capabilities offered for this provider integration to the ones available in other cloud platforms.

Goals

Extend the existing features while deploying OpenShift on IBM Cloud.

Background, and strategic fit

This top level feature is going to be used as a placeholder for the IBM team who is working on new features for this integration in an effort to keep in sync their existing internal backlog with the corresponding Features/Epics in Red Hat's Jira.

 

Epic Goal

  • Enable installation of disconnected clusters on IBM Cloud. This epic will track associated work.

Why is this important?

Scenarios

  1. Install a disconnected cluster on IBM Cloud.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

User Story:

A user currently is not able to create a Disconnected cluster, using IPI, on IBM Cloud. 
Currently, support for BYON and Private clusters does exist on IBM Cloud, but support to override IBM Cloud Service endpoints does not exist, which is required to allow for Disconnected support to function (reach IBM Cloud private endpoints).

Description:

IBM dependent components of OCP will need to add support to use a set of endpoint override values in order to reach IBM Cloud Services in Disconnected environments.

The Image Registry components will need to be able to allow all API calls to IBM Cloud Services, be directed to these endpoint values, in order to communicate in environments where the Public or default IBM Cloud Service endpoint is not available.

The endpoint overrides are available via the infrastructure/cluster (.status.platformStatus.ibmcloud.serviceEndpoints) resource, which is how a majority of components are consuming cluster specific configurations (Ingress, MAPI, etc.). It will be structured as such

apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2023-10-04T22:02:15Z"
  generation: 1
  name: cluster
  resourceVersion: "430"
  uid: b923c3de-81fc-4a0e-9fdb-8c4c337fba08
spec:
  cloudConfig:
    key: config
    name: cloud-provider-config
  platformSpec:
    type: IBMCloud
status:
  apiServerInternalURI: https://api-int.us-east-disconnect-21.ipi-cjschaef-dns.com:6443
  apiServerURL: https://api.us-east-disconnect-21.ipi-cjschaef-dns.com:6443
  controlPlaneTopology: HighlyAvailable
  cpuPartitioning: None
  etcdDiscoveryDomain: ""
  infrastructureName: us-east-disconnect-21-gtbwd
  infrastructureTopology: HighlyAvailable
  platform: IBMCloud
  platformStatus:
    ibmcloud:
      dnsInstanceCRN: 'crn:v1:bluemix:public:dns-svcs:global:a/fa4fd9fa0695c007d1fdcb69a982868c:f00ac00e-75c2-4774-a5da-44b2183e31f7::'
      location: us-east
      providerType: VPC
      resourceGroupName: us-east-disconnect-21-gtbwd
      serviceEndpoints:
      - name: iam
        url: https://private.us-east.iam.cloud.ibm.com
      - name: vpc
        url: https://us-east.private.iaas.cloud.ibm.com/v1
      - name: resourcecontroller
        url: https://private.us-east.resource-controller.cloud.ibm.com
      - name: resourcemanager
        url: https://private.us-east.resource-controller.cloud.ibm.com
      - name: cis
        url: https://api.private.cis.cloud.ibm.com
      - name: dnsservices
        url: https://api.private.dns-svcs.cloud.ibm.com/v1
      - name: cis
        url: https://s3.direct.us-east.cloud-object-storage.appdomain.cloud
    type: IBMCloud

The CCM is currently relying on updates to the openshift-cloud-controller-manager/cloud-conf configmap, in order to override its required IBM Cloud Service endpoints, such as:

data:
  config: |+
    [global]
    version = 1.1.0
    [kubernetes]
    config-file = ""
    [provider]
    accountID = ...
    clusterID = temp-disconnect-7m6rw
    cluster-default-provider = g2
    region = eu-de
    g2Credentials = /etc/vpc/ibmcloud_api_key
    g2ResourceGroupName = temp-disconnect-7m6rw
    g2VpcName = temp-disconnect-7m6rw-vpc
    g2workerServiceAccountID = ...
    g2VpcSubnetNames = temp-disconnect-7m6rw-subnet-compute-eu-de-1,temp-disconnect-7m6rw-subnet-compute-eu-de-2,temp-disconnect-7m6rw-subnet-compute-eu-de-3,temp-disconnect-7m6rw-subnet-control-plane-eu-de-1,temp-disconnect-7m6rw-subnet-control-plane-eu-de-2,temp-disconnect-7m6rw-subnet-control-plane-eu-de-3
    iamEndpointOverride = https://private.iam.cloud.ibm.com
    g2EndpointOverride = https://eu-de.private.iaas.cloud.ibm.com
    rmEndpointOverride = https://private.resource-controller.cloud.ibm.com

Acceptance Criteria:

Installer validates and injects user provided endpoint overrides into cluster deployment process and the Image Registry components use specified endpoints and start up properly.

User Story:

A user currently is not able to create a Disconnected cluster, using IPI, on IBM Cloud. 
Currently, support for BYON and Private clusters does exist on IBM Cloud, but support to override IBM Cloud Service endpoints does not exist, which is required to allow for Disconnected support to function (reach IBM Cloud private endpoints).

Description:

IBM dependent components of OCP will need to add support to use a set of endpoint override values in order to reach IBM Cloud Services in Disconnected environments.

The Ingress Operator components will need to be able to allow all API calls to IBM Cloud Services, be directed to these endpoint values, in order to communicate in environments where the Public or default IBM Cloud Service endpoint is not available.

The endpoint overrides are available via the infrastructure/cluster (.status.platformStatus.ibmcloud.serviceEndpoints) resource, which is how a majority of components are consuming cluster specific configurations (Ingress, MAPI, etc.). It will be structured as such

apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2023-10-04T22:02:15Z"
  generation: 1
  name: cluster
  resourceVersion: "430"
  uid: b923c3de-81fc-4a0e-9fdb-8c4c337fba08
spec:
  cloudConfig:
    key: config
    name: cloud-provider-config
  platformSpec:
    type: IBMCloud
status:
  apiServerInternalURI: https://api-int.us-east-disconnect-21.ipi-cjschaef-dns.com:6443
  apiServerURL: https://api.us-east-disconnect-21.ipi-cjschaef-dns.com:6443
  controlPlaneTopology: HighlyAvailable
  cpuPartitioning: None
  etcdDiscoveryDomain: ""
  infrastructureName: us-east-disconnect-21-gtbwd
  infrastructureTopology: HighlyAvailable
  platform: IBMCloud
  platformStatus:
    ibmcloud:
      dnsInstanceCRN: 'crn:v1:bluemix:public:dns-svcs:global:a/fa4fd9fa0695c007d1fdcb69a982868c:f00ac00e-75c2-4774-a5da-44b2183e31f7::'
      location: us-east
      providerType: VPC
      resourceGroupName: us-east-disconnect-21-gtbwd
      serviceEndpoints:
      - name: iam
        url: https://private.us-east.iam.cloud.ibm.com
      - name: vpc
        url: https://us-east.private.iaas.cloud.ibm.com/v1
      - name: resourcecontroller
        url: https://private.us-east.resource-controller.cloud.ibm.com
      - name: resourcemanager
        url: https://private.us-east.resource-controller.cloud.ibm.com
      - name: cis
        url: https://api.private.cis.cloud.ibm.com
      - name: dnsservices
        url: https://api.private.dns-svcs.cloud.ibm.com/v1
      - name: cis
        url: https://s3.direct.us-east.cloud-object-storage.appdomain.cloud
    type: IBMCloud

The CCM is currently relying on updates to the openshift-cloud-controller-manager/cloud-conf configmap, in order to override its required IBM Cloud Service endpoints, such as:

data:
  config: |+
    [global]
    version = 1.1.0
    [kubernetes]
    config-file = ""
    [provider]
    accountID = ...
    clusterID = temp-disconnect-7m6rw
    cluster-default-provider = g2
    region = eu-de
    g2Credentials = /etc/vpc/ibmcloud_api_key
    g2ResourceGroupName = temp-disconnect-7m6rw
    g2VpcName = temp-disconnect-7m6rw-vpc
    g2workerServiceAccountID = ...
    g2VpcSubnetNames = temp-disconnect-7m6rw-subnet-compute-eu-de-1,temp-disconnect-7m6rw-subnet-compute-eu-de-2,temp-disconnect-7m6rw-subnet-compute-eu-de-3,temp-disconnect-7m6rw-subnet-control-plane-eu-de-1,temp-disconnect-7m6rw-subnet-control-plane-eu-de-2,temp-disconnect-7m6rw-subnet-control-plane-eu-de-3
    iamEndpointOverride = https://private.iam.cloud.ibm.com
    g2EndpointOverride = https://eu-de.private.iaas.cloud.ibm.com
    rmEndpointOverride = https://private.resource-controller.cloud.ibm.com

Acceptance Criteria:

Installer validates and injects user provided endpoint overrides into cluster deployment process and the Ingress Operator components use specified endpoints and start up properly.

Feature Overview (aka. Goal Summary)  

As an openshift admin ,who wants to make my OCP more secure and stable . I want to prevent anyone to schedule their workload in master node so that master node only run OCP management related workload  .

 

Goals (aka. expected user outcomes)

secure OCP master node by preventing scheduling of customer workload in master node

 

 

 

 

 

Anyone applying toleration(s) in a pod spec can unintentionally tolerate master taints which protect master nodes from receiving application workload when master nodes are configured to repel application workload. An admission plugin needs to be configured to protect master nodes from this scenario. Besides the taint/toleration, users can also set spec.NodeName directly, which this plugin should also consider protecting master nodes against.

Feature Overview (aka. Goal Summary)  

The MCO should properly report its state in a way that's consistent and able to be understood by customers, troubleshooters, and maintainers alike. 

Some customer cases have revealed scenarios where the MCO state reporting is misleading and therefore could be unreliable to base decisions and automation on.

In addition to correcting some incorrect states, the MCO will be enhanced for a more granular view of update rollouts across machines.

The MCO should properly report its state in a way that's consistent and able to be understood by customers, troubleshooters, and maintainers alike. 

For this epic, "state" means "what is the MCO doing?" – so the goal here is to try to make sure that it's always known what the MCO is doing. 

This includes: 

  • Conditions
  • Some Logging 
  • Possibly Some Events 

While this probably crosses a little bit into the "status" portion of certain MCO objects, as some state is definitely recorded there, this probably shouldn't turn into a "better status reporting" epic.  I'm interpreting "status" to mean "how is it going" so status is maybe a "detail attached to a state". 

 

Exploration here: https://docs.google.com/document/d/1j6Qea98aVP12kzmPbR_3Y-3-meJQBf0_K6HxZOkzbNk/edit?usp=sharing

 

https://docs.google.com/document/d/17qYml7CETIaDmcEO-6OGQGNO0d7HtfyU7W4OMA6kTeM/edit?usp=sharing

 

The current property description is:

configuration represents the current MachineConfig object for the machine config pool.

But in a 4.12.0-ec.4 cluster, the actual semantics seem to be something closer to "the most recent rendered config that we completely leveled on". We should at least update the godocs to be more specific about the intended semantics. And perhaps consider adjusting the semantics?

Customer has escalated the following issues where ports don't have TLS support. This Feature request lists all the components port raised by the customer.

Details here https://docs.google.com/document/d/1zB9vUGB83xlQnoM-ToLUEBtEGszQrC7u-hmhCnrhuXM/edit

https://access.redhat.com/solutions/5437491

Currently, we are serving the metrics as http on 9191, and via TLS on 9192.

We need to make sure the metrics are only available on 9192 via TLS.

 

Related to https://issues.redhat.com/browse/RFE-4665

Background

CMA currently exposes metrics on two ports via the 0.0.0.0 all hosts binding. We need to make sure that only the TLS port is accessible from outside localhost.

Steps

  • Move the binding for the local metrics server to localhost only
  • Ensure kube-rbac-proxy is still proxying the requests over TLS

Stakeholders

  • Cluster Infra
  • Subin M

Definition of Done

  • Metrics from CMA are only exposed over TLS
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned CAPI components should be running on Kubernetes 1.28
  • target is 4.16 since CAPI is always a release behind upstream

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned components should be running on Kubernetes 1.29
  • This includes
    • The cluster autoscaler (+operator)
    • Machine API operator
      • Machine API controllers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cloud Controller Manager Operator
      • Cloud controller managers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cluster Machine Approver
    • Cluster API Actuator Package
    • Control Plane Machine Set Operator

Why is this important?

  • ...

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

Goal:

As an administrator, I would like to use my own managed DNS solution instead of only specific openshift-install supported DNS services (such as AWS Route53, Google Cloud DNS, etc...) for my OpenShift deployment.

 

Problem:

While cloud-based DNS services provide convenient hostname management, there's a number of regulatory (ITAR) and operational constraints customers face prohibiting the use of those DNS hosting services on public cloud providers.

 

Why is this important:

  • Provides customers with the flexibility to leverage their own custom managed ingress DNS solutions already in use within their organizations.
  • Required for regions like AWS GovCloud in which many customers may not be able to use the Route53 service (only for commercial customers) for both internal or ingress DNS.
  • OpenShift managed internal DNS solution ensures cluster operation and nothing breaks during updates.

 

Dependencies (internal and external):

 

Prioritized epics + deliverables (in scope / not in scope):

  • Ability to bootstrap cluster without an OpenShift managed internal DNS service running yet
  • Scalable, cluster (internal) DNS solution that's not dependent on the operation of the control plane (in case it goes down)
  • Ability to automatically propagate DNS record updates to all nodes running the DNS service within the cluster
  • Option for connecting cluster to customers ingress DNS solution already in place within their organization

 

Estimate (XS, S, M, L, XL, XXL):

 

Previous Work:

 

Open questions:

 

Link to Epic: https://docs.google.com/document/d/1OBrfC4x81PHhpPrC5SEjixzg4eBnnxCZDr-5h3yF2QI/edit?usp=sharing

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • At this point in the feature, we would have a working in-cluster CoreDNS pod capable of resolving API and API-Int URLs.

This Epic details that work required to augment this CoreDNS pod to also resolve the *.apps URL. In addition, it will include changes to prevent Ingress Operator from configuring the cloud DNS after the ingress LBs have been created.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

  • https://github.com/openshift/api/pull/1685 introduced updates that allows the  LB IPs to be added to GCPPlatformStatus along with the state of DNS for the cluster.
  • Update cluster-ingress-operator to add the Ingress LB IPs when DNSType is `ClusterHosted`
  • In this state, Within https://github.com/openshift/api/blob/master/operatoringress/v1/types.go set the DNSManagementPolicy to Unmanaged within the DNSRecordSpec when the DNS manifest has customer Managed DNS enabled. 
  • With the DNSManagementPolicy set to Unmanaged, the IngressController should not try to configure DNS records.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

 

Feature Overview (aka. Goal Summary)

Support ARO and customers aspiring to follow Microsoft Azure security recommendations by allowing the Azure storage account hosting the object storage bucket for the integrated registry to be configured as "private" vs. the default public.

Goals (aka. expected user outcomes)

OpenShift installations on Azure can be configured so that they don't trigger the Azure's Security Advisor anymore with regard to the use of a public endpoint for the Azure storage account created by the integrated registry operator.

Requirements (aka. Acceptance Criteria):

  • Manual private storage account configuration via the Integrated Registry operator's CR
  • Automatic discovery of vnet and subnet resources requiring the end user to only set a single flag of the Integrated Registry operator CR to "internal"

Background

Several users noticed warnings in Azure Security advisor reporting the potentially dangerous exposure of the storage endpoint used by the integrated registry configured by its operator. There is no real security threat here because despite the endpoint being public, access to it is strictly locked down to a single set of credentials used by the internal registry only.

Still customers need to be able to deploy cluster that out of the box do not violate Microsoft security recommendations.

This feature sets the foundation for OCPSTRAT-997 to be delivered.

Customer Considerations

Customers updating to the version of OpenShift that delivers this feature shall not have their integrated registry configuration updated automatically.

Documentation Considerations

We require documentation in the section for the integrated registry operator that describes how to manually configure the vnet and subnet that shall be used for the private storage endpoint in case the customer wants to leverage an network resource group account different from the cluster.

We also require documentation that describes the single tunable for the integrated registry operator that is required to be set to "internal" to automate the detection of existing vnet and subnets in the network resource group of the cluster as opposed to manual specification of a user-defined vnet/subnet pair.

Epic Goal

  • Enable customers to create an OpenShift IPI deployment on Azure that does not violate Microsoft Azure Security recommendations

Why is this important?

  • In the current implementation the OpenShift internal registry operator creates a storage account for the backing image blob store of the registry that has a public endpoint
  • the public endpoint is considered insecure by Azure's Security advisor and prevents customers from completing security-relevant certifications
  • while manual remediation is possible it is not acceptable for customers who want to leverage OpenShift IPI capabilities to deploy many clusters as automated and with as little post-install infrastructure customisation as possible

Scenarios

  1. The openshift-installer allows the user to specify VNet and Subnet names to be used for a private endpoint of the storage account, this setting requires the user to specify that they want to create private storage account instead of a public one
  2. If the user configured to creation of a private storage account, the integrated registry operators uses the user-specified VNet and Subnet names as input and creates a private storage account in Azure using a private endpoint with the supplied VNet and Subnet
  3. The current implementation of creating a storage account with a public endpoint remains the default if nothing else is configured

Acceptance Criteria

  • A user can enable the creation of a private storage account in Azure for the internal registry via the openshift installer
  • A user must supply a VNet and Subnet name if they opted into private storage accounts
  • The internal registry creates an Azure storage account with a private endpoint if the user opted into private storage accounts using the supplied VNet and Subnet names
  • Private storage accounts are not the default
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. Installer allows to specify VNet and Subnet names for a private storage account on Azure

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Story: As an image registry developer, I want to be able to programmatically create a private endpoint in Azure without having to worry about explicitly creating the supporting objects, so that I can easily enable support for private storage accounts in Azure.

ACCEPTANCE CRITERIA

When the registry is in private mode:

  • Private endpoint is created in Azure
  • Private DNS zone is created in Azure
  • Record Set (type A) is created in Azure
  • Private DNS Zone Group is created in Azure
  • Virtual Network Link is created between Private Endpoint and VNet
  • Configuring the registry to use a private endpoint, then setting it to "Removed" should delete the private endpoint from Azure.

DOCUMENTATION

  • No documentation is needed at this point

 

Story: As a user, I want to be able to configure the registry operator to use Azure Private Endpoints so that I can deploy the registry on Azure without a public facing endpoint.

ACCEPTANCE CRITERIA

  • There is an option to configure the registry operator to deploy the registry privately on Azure (this should not be available to other cloud providers)
  • When configuring the operator to deploy the registry privately, the user is also required to provide names for the cluster's VNet and Subnet for the operator to configure the private endpoint in
  • Configuring the operator to make the registry private also disables public access network in the storage account
  • Setting the registry back to public deletes the private endpoint and enables public access again
  • The operand's conditions reflects any errors that might happen during this procedure
  • When the registry is configured with private endpoints, pulling images from the registry outside of OCP will only work by first setting "disableRedirect: true" (assuming a route is configured)

DOCUMENTATION

  • Update post-installation docs for private clusters on Azure
    • Placement of this docs needs further investigation, as the post-install for private clusters does not seem cloud provider specific and we need this one to be just for Azure.
  • Installer documentation for private clusters on Azure should not be updated (there is no supported way to enable this feature through installer-config at this point)
  • The procedure to configure the registry to private should also mention that pulling images from the registry using the default route (provided by setting `defaultRoute: true`) will no longer work UNLESS customers set `disableRedirect: true` in the operator configuration.

Feature Overview

Telecommunications providers continue to deploy OpenShift at the Far Edge. The acceleration of this adoption and the nature of existing Telecommunication infrastructure and processes drive the need to improve OpenShift provisioning speed at the Far Edge site and the simplicity of preparation and deployment of Far Edge clusters, at scale.

Goals

  • Simplicity The folks preparing and installing OpenShift clusters (typically SNO) at the Far Edge range in technical expertise from technician to barista. The preparation and installation phases need to be reduced to a human-readable script that can be utilized by a variety of non-technical operators. There should be as few steps as possible in both the preparation and installation phases.
  • Minimize Deployment Time A telecommunications provider technician or brick-and-mortar employee who is installing an OpenShift cluster, at the Far Edge site, needs to be able to do it quickly. The technician has to wait for the node to become in-service (CaaS and CNF provisioned and running) before they can move on to installing another cluster at a different site. The brick-and-mortar employee has other job functions to fulfill and can't stare at the server for 2 hours. The install time at the far edge site should be in the order of minutes, ideally less than 20m.
  • Utilize Telco Facilities Telecommunication providers have existing Service Depots where they currently prepare SW/HW prior to shipping servers to Far Edge sites. They have asked RH to provide a simple method to pre-install OCP onto servers in these facilities. They want to do parallelized batch installation to a set of servers so that they can put these servers into a pool from which any server can be shipped to any site. They also would like to validate and update servers in these pre-installed server pools, as needed.
  • Validation before Shipment Telecommunications Providers incur a large cost if forced to manage software failures at the Far Edge due to the scale and physical disparate nature of the use case. They want to be able to validate the OCP and CNF software before taking the server to the Far Edge site as a last minute sanity check before shipping the platform to the Far Edge site.
  • IPSec Support at Cluster Boot Some far edge deployments occur on an insecure network and for that reason access to the host’s BMC is not allowed, additionally an IPSec tunnel must be established before any traffic leaves the cluster once its at the Far Edge site. It is not possible to enable IPSec on the BMC NIC and therefore even OpenShift has booted the BMC is still not accessible.

Requirements

  • Factory Depot: Install OCP with minimal steps
    • Telecommunications Providers don't want an installation experience, just pick a version and hit enter to install
    • Configuration w/ DU Profile (PTP, SR-IOV, see telco engineering for details) as well as customer-specific addons (Ignition Overrides, MachineConfig, and other operators: ODF, FEC SR-IOV, for example)
    • The installation cannot increase in-service OCP compute budget (don't install anything other that what is needed for DU)
    • Provide ability to validate previously installed OCP nodes
    • Provide ability to update previously installed OCP nodes
    • 100 parallel installations at Service Depot
  • Far Edge: Deploy OCP with minimal steps
    • Provide site specific information via usb/file mount or simple interface
    • Minimize time spent at far edge site by technician/barista/installer
    • Register with desired RHACM Hub cluster for ongoing LCM
  • Minimal ongoing maintenance of solution
    • Some, but not all telco operators, do not want to install and maintain an OCP / ACM cluster at Service Depot
  • The current IPSec solution requires a libreswan container to run on the host so that all N/S OCP traffic is encrypted. With the current IPSec solution this feature would need to support provisioning host-based containers.

 

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts.  If a non MVP requirement slips, it does not shift the feature.

requirement Notes isMvp?
     
     
     

 

Describe Use Cases (if needed)

Telecommunications Service Provider Technicians will be rolling out OCP w/ a vDU configuration to new Far Edge sites, at scale. They will be working from a service depot where they will pre-install/pre-image a set of Far Edge servers to be deployed at a later date. When ready for deployment, a technician will take one of these generic-OCP servers to a Far Edge site, enter the site specific information, wait for confirmation that the vDU is in-service/online, and then move on to deploy another server to a different Far Edge site.

 

Retail employees in brick-and-mortar stores will install SNO servers and it needs to be as simple as possible. The servers will likely be shipped to the retail store, cabled and powered by a retail employee and the site-specific information needs to be provided to the system in the simplest way possible, ideally without any action from the retail employee.

 

Out of Scope

Q: how challenging will it be to support multi-node clusters with this feature?

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>

<Does the Feature introduce data that could be gathered and used for Insights purposes?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

< What does success look like?>

< Does this feature have doc impact?  Possible values are: New Content, Updates to existing content,  Release Note, or No Doc Impact>

< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

  • <What concepts do customers need to understand to be successful in [action]?>
  • <How do we expect customers will use the feature? For what purpose(s)?>
  • <What reference material might a customer want/need to complete [action]?>
  • <Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
  • <What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Interoperability Considerations

< Which other products and versions in our portfolio does this feature impact?>

< What interoperability test scenarios should be factored by the layered product(s)?>

Questions

Question Outcome
   

 

 

Epic Goal

  • Install SNO within 10 minutes

Why is this important?

  • SNO installation takes around 40+ minutes.
  • This makes SNO less appealing when compared to k3s/microshift.
  • We should analyze the  SNO installation, figure our why it takes so long and come up with ways to optimize it

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. https://docs.google.com/document/d/1ULmKBzfT7MibbTS6Sy3cNtjqDX1o7Q0Rek3tAe1LSGA/edit?usp=sharing

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

As part of the efforts of improving the installation time of single node openshift, we've noticed the monitoring operator takes a long* time to finish installation.

It's hard for me to tell what exactly the monitoring operator is waiting for, but it becoming happy (as far as clusteroperator conditions are concerned) always seems to coincide with the operator finally realizing and reconciling** the 2 additional certificates inside the extension-apiserver-authentication that are being added by the apiserver operator. 

Usually this "realization" happens minutes after the two certs are being added, and ideally we'd like to cut back on that time, because sometimes those minutes lead to the monitoring operator being the last to roll out.

*Long time on the order of just a few minutes, which are not a lot but they add up. This ticket is one in a series of ticket we're opening for many other components

**The "marker" I use to know when this happened is when the monitoring operator, among other things, replaces the old prometheus-adapter-<hash_x> secret containing just the original certs of extension-apiserver-authentication with a new prometheus-adapter-<hash_y> which also contains the 2 new certs

Version-Release number of selected component (if applicable):

nightly 4.13 OCP

How reproducible:

100%

Steps to Reproduce:

1. Install single-node-openshift

Actual results:

Monitoring operator long delay reconciling extension-apiserver-authentication

Expected results:

Monitoring operator immediate reconciliation of extension-apiserver-authentication

Additional info:

Originally I suspected this might be due to api server downtime (which is a property of SNO), but this issue doesn't seem to correlate with API downtime

Feature Overview

Reduce the OpenShift platform and associated RH provided components to a single physical core on Intel Sapphire Rapids platform for vDU deployments on SingleNode OpenShift.

Goals

  • Reduce CaaS platform compute needs so that it can fit within a single physical core with Hyperthreading enabled. (i.e. 2 CPUs)
  • Ensure existing DU Profile components fit within reduced compute budget.
  • Ensure existing ZTP, TALM, Observability and ACM functionality is not affected.
  • Ensure largest partner vDU can run on Single Core OCP.

Requirements

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES
 
Provide a mechanism to tune the platform to use only one physical core. 
Users need to be able to tune different platforms.  YES 
Allow for full zero touch provisioning of a node with the minimal core budget configuration.   Node provisioned with SNO Far Edge provisioning method - i.e. ZTP via RHACM, using DU Profile. YES 
Platform meets all MVP KPIs   YES

(Optional) Use Cases

  • Main success scenario: A telecommunications provider uses ZTP to provision a vDU workload on Single Node OpenShift instance running on an Intel Sapphire Rapids platform. The SNO is managed by an ACM instance and it's lifecycle is managed by TALM.

Questions to answer...

  • N/A

Out of Scope

  • Core budget reduction on the Remote Worker Node deployment model.

Background, and strategic fit

Assumptions

  • The more compute power available for RAN workloads directly translates to the volume of cell coverage that a Far Edge node can support.
  • Telecommunications providers want to maximize the cell coverage on Far Edge nodes.
  • To provide as much compute power as possible the OpenShift platform must use as little compute power as possible.
  • As newer generations of servers are deployed at the Far Edge and the core count increases, no additional cores will be given to the platform for basic operation, all resources will be given to the workloads.

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
    • Administrators must know how to tune their Far Edge nodes to make them as computationally efficient as possible.
  • Does this feature have doc impact?
    • Possibly, there should be documentation describing how to tune the Far Edge node such that the platform uses as little compute power as possible.
  • New Content, Updates to existing content, Release Note, or No Doc Impact
    • Probably updates to existing content
  • If unsure and no Technical Writer is available, please contact Content Strategy. What concepts do customers need to understand to be successful in [action]?
    • Performance Addon Operator, tuned, MCO, Performance Profile Creator
  • How do we expect customers will use the feature? For what purpose(s)?
    • Customers will use the Performance Profile Creator to tune their Far Edge nodes. They will use RHACM (ZTP) to provision a Far Edge Single-Node OpenShift deployment with the appropriate Performance Profile.
  • What reference material might a customer want/need to complete [action]?
    • Performance Addon Operator, Performance Profile Creator
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
    • N/A
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
    • Likely updates to existing content / unsure
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description of problem:

After running tests on an SNO with Telco DU profile for a couple of hours kubernetes.io/kubelet-serving CSRs in Pending state start showing up and accumulating in time.

Version-Release number of selected component (if applicable):

4.14.0-ec.3

How reproducible:

So far on 2 different environments

Steps to Reproduce:

1. Deploy SNO with Telco DU profile
2. Run system tests
3. Check CSRs status

Actual results:

oc get csr | grep Pending | wc -l
34

Expected results:

No Pending CSRs

Additional info:

This issue blocks retrieving pod logs.

Attaching must-gather and sosreport after manually approving CSRs.
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Proposed title of this feature request

node-exporter collector customizations

What is the nature and description of the request?

We've had several requests and efforts in the past to change, add or drop certain collectors in our node-exporter statefulset. This feature requests intends to subsume the individual requests and channel these efforts into a cohesive feature that will serve the original requests and ideally address future needs.

We are going to add a section for Node Exporter into the CMO config. Each customizable collector has its dedicated subsection containing its own parameters under the NodeExporter section. The config map looks like the example below.

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |

    nodeExporter: 
      collectors: 
        # disable a collector which is enabled by default
        netclass: 
          enabled: false
        # enable a collector which isn't enabled by default but supported by CMO.
        hwmon: 
          enabled: true
        # tweak the configuration of a collector
        netdev: 
          enabled: true
          ignoredDevices: 
            - br-.+
            - veth.*
          fooBar: xxx # unrecognized parameter will be ignored.
        # trying to disable unsupported collectors or turning off mandatory collectors will be ineffective too
        # as the keys wouldn't exist in the node exporter config struct.
        drbd: {enabled: true} # unsupported collectors will be ignored and logged as a warning
        cpu: {enabled: false} # necessary collectors will not turn off and this invalid config is logged as a warning.

No web UI changes are required, all settings are saved in CMO configmap.

Users should be informed of consequences of deactivating certain collectors, such as missing metrics for alerts and dashboards.

Why does the customer need this? (List the business requirements)

See linked issues for customer requests. Ultimately this will serve to add flexibility to our stack and serve customer needs better.

List any affected packages or components.

node-exporter, CMO config

Node Exporter has a new implementation of netclass collector. The new implementation uses netlink instead of sysfs to collect network device metrics, providing improved performance.
In order to activate it, we add a boolean parameter `netlinkImpl` for netclass collector of Node Exporter in CMO config. The default value is true, activating the netlink implementation to improve performance.
Here is an example of this new CMO config:

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |

    nodeExporter: 
      collectors: 
        netclass: 
          enabled: true
          # enable netlink implementation of netclass collector.
          useNetlink: true
  

This config will add the argument `--collector.netclass.netlink` to the node exporter argument list.

We will add a section for "netdev" Collector in "nodeExporter.collectors" section in CMO configmap. 

It has a boolean field "enabled", the default value is true.

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |

    nodeExporter: 
      collectors: 
        # disable a collector which is enabled by default
        netdev: 
          enabled: false

Before implementing this feature, check the metrics from this collector is not necessary for alerts and dashboards.

We will add a section for "netclass" Collector in "nodeExporter.collectors" section in CMO configmap. 

It has a boolean field "enabled", the default value is true.

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |

    nodeExporter: 
      collectors: 
        # disable a collector which is enabled by default
        netclass: 
          enabled: false

Before implementing this feature, check the metrics from this collector is not necessary for alerts and dashboards.

We will add a section for "buddyinfo" Collector in "nodeExporter.collectors" section in CMO configmap. 

It has a boolean field "enabled", the default value is false.

apiVersion: v1
kind: ConfigMap
metadata: 
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data: 

  config.yaml: |

    nodeExporter: 
      collectors: 
        # enable a collector which is disabled by default
        buddyinfo: 
          enabled: true

refer to: https://issues.redhat.com/browse/OBSDA-44

 

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

Epic Goal

  • Update OpenShift components that are owned by the Builds + Jenkins Team to use Kubernetes 1.25

Why is this important?

  • Our components need to be updated to ensure that they are using the latest bug/CVE fixes, features, and that they are API compatible with other OpenShift components.

Acceptance Criteria

  • Existing CI/CD tests must be passing

This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.

Today the links point at a rule-scoped page, but that page lacks information about recommended resolution.  You can click through by cluster ID to your specific cluster and get that recommendation advice, but it would be more convenient and less confusing for customers if we linked directly to the cluster-scoped recommendation page.

We can implement by updating the template here to be:

fmt.Sprintf("https://console.redhat.com/openshift/insights/advisor/clusters/%s?first=%s%%7C%s", clusterID, ruleIDStr, rec.ErrorKey)

or something like that.

 

unknowns

request is clear, solution/implementation to be further clarified

This epic contains all the Dynamic Plugins related stories for OCP release-4.11 

Epic Goal

  • Track all the stories under a single epic

Acceptance Criteria

  •  

This story only covers API components. We will create a separate story for other utility functions.

Today we are generating documentation for Console's Dynamic Plugin SDK in
frontend/packages/dynamic-plugin-sdk. We are missing ts-doc for a set of hooks and components.

We are generating the markdown from the dynamic-plugin-sdk using

yarn generate-doc

Here is the list of the API that the dynamic-plugin-sdk is exposing:

https://gist.github.com/spadgett/0ddefd7ab575940334429200f4f7219a

Acceptance Criteria:

  • Add missing jsdocs for the API that dynamic-plugin-sdk exposes

Out of Scope:

  • This does not include work for integrating the API docs into the OpenShift docs
  • This does not cover other public utilities, only components.

This epic contains all the Dynamic Plugins related stories for OCP release-4.12

Epic Goal

  • Track all the stories under a single epic

Acceptance Criteria

Following https://coreos.slack.com/archives/C011BL0FEKZ/p1650640804532309, it would be useful for us (network observability team) to have access to ResourceIcon in dynamic-plugin-sdk.

Currently ResourceLink is exported but not ResourceIcon

 

AC:

  • Require the ResourceIcon  from public to dynamic-plugin-sdk
  • Add the component to the dynamic-demo-plugin
  • Add a CI test to check for the ResourceIcon component

 

Move `frontend/public/components/nav` to `packages/console-app/src/components/nav` and address any issues resulting from the move.

There will be some expected lint errors relating to cyclical imports. These will require some refactoring to address.

We should have a global notification or the `Console plugins` page (e.g., k8s/cluster/operator.openshift.io~v1~Console/cluster/console-plugins) should alert users when console operator `spec.managementState` is `Unmanaged` as changes to `enabled` for plugins will have no effect.