Back to index

4.14.0-0.okd-2024-01-06-084517

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.13.0-0.okd-scos-2024-04-09-152021

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Based on the perf scale team's results, enabling multiqueue when jumbo frames (MTU >=9000) can greatly improve throughput. as see by comparing slides 8 and 10 in this slide deck, https://docs.google.com/presentation/d/1cIm4EcAswVDpuDp-eHVmbB7VodZqQzTYCnx4HCfI9n4/edit#slide=id.g2563dda6aa5_1_68
However enabling multiqueue with small MTU causes "throughput to crater".

This task involves adding an API option to the kubevirt platform within the nodepool api, as well as adding a cli option for enabling multiqueue in the hcp cli (new productized cli)

The HyperShift KubeVirt platform only supports guest clusters running 4.14 or greater (due to the kubevirt rhcos image only being delivered in 4.14)

and it also only supports OCP 4.14 and CNV 4.14 for the infra cluster. 

 

Add backend validation on the HostedCluster that validates the parameters are correct before processing the hosted cluster. If these conditions are not met, then report back the error as a condition on the hosted cluster CR

Goal

Improve the kubevirt-csi storage plugin features and integration as we make progress towards the GA of a KubeVirt provider for HyperShift.

User Stories

  • "As a hypershift user,
    I want infra cluster StorageClasses made available to guest clusters,
    so that guest clusters can have persistent storage available."

Infra storage classes made available to guest clusters must support:

  • RWX AccessMode
  • Filesystem and Block VolumeModes

Non-Requirements

  • VolumeSnapshots
  • CSI Clone
  • Volume Expansion

Notes

  • Any additional details or decisions made/needed

Done Checklist

Who What Reference
DEV Upstream roadmap issue (or individual upstream PRs) <link to GitHub Issue>
DEV Upstream documentation merged <link to meaningful PR>
DEV gap doc updated <name sheet and cell>
DEV Upgrade consideration <link to upgrade-related test or design doc>
DEV CEE/PX summary presentation label epic with cee-training and add a <link to your support-facing preso>
QE Test plans in Polarion <link or reference to Polarion>
QE Automated tests merged <link or reference to automated tests>
DOC Downstream documentation merged <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

 

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Complete during New status.

 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Design Doc:

https://docs.google.com/document/d/1m6OYdz696vg1v8591v0Ao0_r_iqgsWjjM2UjcR_tIrM/

Problem:

Goal

As a developer, I want to be able to test my serverless function after it's been deployed.

Why is it important?

Use cases:

  1. As a developer, I want to test my serverless function 

Acceptance criteria:

  1. This features needs to work in ACM (Multi cluster environment when console is being run on the Hub cluster)

Dependencies (External/Internal):

Please add a spike to see if there are dependencies.

Design Artifacts:

Exploration:

Developers can use the the kn func invoke CLI to accomplish this. According to Naina, there is an API, but it's in Go.

Note:

Description

As a user, I want to invoke a Serverless function from the developer console. This action should be available as a page and as a modal.

This story is to evaluate a good UI for this and check this with our PM (Serena) and the Serverless team (Naina and Lance).

Acceptance Criteria

  1. Add a new page with title "Invoke Serverless function {function-name}" and should be available via a new URL (/serverless/ns/:ns/invoke-function/:function-name/).
  2. Implement a form with formik to "invoke" (console.log for now) Serverless functions, without writing the network call for this already. Focus on the UI to get feedback as early as possible. Use reusable, well-named components anyway.
  3. The page should be also available as a modal. Add a new action to all Serverless Services with the label (tbd) to open this modal from the Topology graph or from the Serverless Service list view.
  4. The page should have two tabs or two panes for the request and response. Each of this tabs/panes should have again two tabs, "similar" to the browser network inspector. See below for what we know currently.
  5. Get confirmation from Christoph, Serena, Naina, and Lance.
  6. Disable the action until we implement the network communication in ODC-7275 with the serverless function.
  7. No e2e tests are needed for this story.

Additional Details:

Information the form should show:

  1. Request tab shows "Body" and "Options" tab
    1. Body is just a full size editor. We should reuse our code editor.
    2. Options contains:
      1. Auto complete text field “Content type” with placeholder “application/json”, that will be used when nothing is entered
      2. Dropdown “Format” with values “cloudevent” (default) and “http”
      3. Text field “Type” with placeholder text “boson.fn”, that will be used when nothing is entered
      4. Text field “Source” with placeholder “/boson/fn”, that will be used when nothing is entered
  2. Response tab shows Body and Info tab
    1. Body is a full size editor that shows the response. We should format a JSON string with JSON.stringify(data, null, 2)
    2. Info contains:
      1. Id (id)
      2. Type (type)
      3. Source (source)
      4. Time (time) (formatted)
      5. Content-Type: (datacontenttype)

Description

Current YAMLEditor also supports other languages like JSON. Therefore need to rename the component.

Acceptance Criteria

  1. Rename all instances of YAMLEditor to CodeEditor

Additional Details:

Description

As a user, I want to invoke a Serverless function from the developer console. This action should be available as a page and as a modal.

Acceptance Criteria

  1. A backend proxy to invoke a serverless function (or a k8s service in general) from the frontend without a public route.
  2. The API endpoint should be only accessible to logged-in users.
  3. Should also work when the bridge is running off-cluster (as developers start them mostly for local development)

Additional Details:

This will be similar to the web terminal proxy, except that no auth headers will be passed to the underlying service.

We need something similar to:

POST /proxy/in-cluster

{
  endpoint: string
  # Or just service: string ?? tbd.

  headers: Record<string, string | string[]>
  body: string
  timeout: number
}

Description

As a user, I want to invoke a Serverless function from the developer console. This action should be available as a page and as a modal.

This story depends on ODC-7273, ODC-7274, and ODC-7288. This story should bring the backend proxy, and the frontend together and finalize the work.

Acceptance Criteria

  1. Write proper types if they are missed
  2. Connect the form and invoke a serverless function, consume and show the response
  3. Unit testes
  4. E2E tests

Additional Details:

< High-Level description of the feature ie: Executive Summary >

Goals

Cluster administrators need an in-product experience to discover and install new Red Hat offerings that can add high value to developer workflows.

Requirements

Requirements Notes IS MVP
Discover new offerings in Home Dashboard   Y
Access details outlining value of offerings   Y
Access step-by-step guide to install offering   N
Allow developers to easily find and use newly installed offerings   Y
Support air-gapped clusters   Y
    • (Optional) Use Cases

< What are we making, for who, and why/what problem are we solving?>

Out of scope

Discovering solutions that are not available for installation on cluster

Dependencies

No known dependencies

Background, and strategic fit

 

Assumptions

None

 

Customer Considerations

 

Documentation Considerations

Quick Starts 

What does success look like?

 

QE Contact

 

Impact

 

Related Architecture/Technical Documents

 

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in production environment

Problem:

Developers using Dev Console need to be made aware of the RH developer tooling available to them.

Goal:

Provide awareness to developers using Dev Console of the RH developer tooling that is available to them, including:

Consider enhancing the +Add page and/or the Guided tour

Provide a Quick Start for installing the Cryostat Operator

Why is it important?

To increase usage of our RH portfolio

Acceptance criteria:

  1. Quick Start - Installing Cryostat Operator
  2.  Quick Start - Get started with JBoss EAP using a Helm Chart
  3. Discoverability of the IDE extensions from Create Serverless form
  4. Update Terminal step of the Guided Tour to indicate that odo CLI is accessible (link to https://developers.redhat.com/products/odo/overview)

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

Update Terminal step of the Guided Tour to indicate that odo CLI is accessible - https://developers.redhat.com/products/odo/overview

Acceptance Criteria

  1. Update Guided tour of Web Terminal to add odo CLI link
  2. On click of link user has to redirected to respective page

Additional Details:

Description 

Add OpenShift Quickstart for JBoss EAP 7

Acceptance Criteria

  1. Add OpenShift Quickstart for JBoss EAP 7

Additional Details:

Description

This story is to add new Quick Start for installing the Cryostat Operator

Acceptance Criteria

  1. Create new Quick Start for installing the Cryostat Operator

Additional Details:

Description

Add below IDE extensions in create serverless form,

Acceptance Criteria

  1. In create serverless form add above IDE extensions
  2. On click of the link, user needs to take to respective pages
  3. Add e2e tests for that

Additional Details:

We are deprecating DeploymentConfig with Deployment in OpenShift because Deployment is the recommended way to deploy applications. Deployment is a more flexible and powerful resource that allows you to control the deployment of your applications more precisely. DeploymentConfig is a legacy resource that is no longer necessary. We will continue to support DeploymentConfig for a period of time, but we encourage you to migrate to Deployment as soon as possible.

Here are some of the benefits of using Deployment over DeploymentConfig:

  • Deployment is more flexible. You can specify the number of replicas to deploy, the image to deploy, and the environment variables to use.
  • Deployment is more powerful. You can use Deployment to roll out changes to your applications in a controlled manner.
  • Deployment is the recommended way to deploy applications. OpenShift will continue to improve Deployment and make it the best way to deploy applications.

We hope that you will migrate to Deployment as soon as possible. If you have any questions, please contact us.

Epic Goal

  • Make it possible to disable the DeploymentConfig and BuildConfig APIs, and associated controller logic.

 

Given the nature of this component (embedded into a shared api server and controller manager), this will likely require adding logic within those shared components to not enable specific bits of function when the build or DeploymentConfig capability is disabled, and watching the enabled capability set so that the components enable the functionality when necessary.

I would not expect us to split the components out of their existing location as part of this, though that is theoretically an option.

 

Why is this important?

  • Reduces resource footprint and bug surface area for clusters that do not need to utilize the DeploymentConfig or BuildConfig functionality, such as SNO and OKE.

Acceptance Criteria (Mandatory)

  • CI - MUST be running successfully with tests automated (we have an existing CI job that runs a cluster with all optional capabilities disabled.  Passing that job will require disabling certain deploymentconfig tests when the cap is disabled)
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. Cluster install capabilities

Previous Work (Optional):

  1. The optional cap architecture and guidance for adding a new capability is described here: https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md

Open questions::

None

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in production environment

Make the list of enabled/disable controllers in OAS reflect enabled/disabled capabilities.

Acceptance criteria:

  • OAS allows to specify a list of enabled/disabled APIs (e.g. watches, caches, ...)
  • OASO watches capabilities and generates the right configuration for OAS with enabled/disabled list of APIs
  • Documentation is properly updated

QE:

  • enabled/disable capabilities and validate a given API (DC, Builds, ...) is/is not managed by a cluster:
  • checking the OAS logs do/do not log entries about affected API(s)
  • DC/Builds objects are created/fail to be created

Feature Overview

At the moment, HyperShift is relying on an older etcd operator (i.e, the CoreOS etcd operator). However, this operator is basic and does not support HA as required.  

Goals

Introduce a reliable component to operate Etcd that: 

  • Is backed up by a stable operator
  • Supports Images with a Hash
  • Supprts  for Backups
  • Local-persistent volumes for persistent data? 
  • Encryption.
  • HA and Scalablity. 

 

Following on from https://issues.redhat.com/browse/HOSTEDCP-444 we need to add the steps to enable migration of the Node/CAPI resources to enable workloads to continue running during controlplane migration.

This will be a manual process where controlplane downtime will occur.

 

This must satisfy a successful migration criteria:

  • All HC conditions are positive.
  • All NodePool conditions are positive.
  • All service endpoints kas/oauth/ignition server... are reachable.
  • Ability to create/scale NodePools remains operational.

We need to validate and document this manually for starters.

Eventually this should be automated in the upcoming e2e test.

We could even have a job running conformance tests over a migrated cluster

Epic Goal

As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.

As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.

Why is this important?

Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.

Acceptance Criteria

  • I can specify static IPs for node VMs at install time with IPI

Previous Work

Bare metal related work:

CoreOS Afterburn:

https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28

https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.

As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.

Why is this important?

Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.

Acceptance Criteria

  • I can specify static IPs for node VMs at install time with IPI

Previous Work

Bare metal related work:

CoreOS Afterburn:

https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28

https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

{}USER STORY:{}

As an OpenShift administrator, I want to apply an IP configuration so that I can adhere to my organizations security guidelines.

{}DESCRIPTION:{}

The vSphere machine controller needs to be modified to convert nmstate to `guestinfo.afterburn.initrd.network-kargs` upon cloning the template for a new machine.  An example of this is here: https://github.com/openshift/machine-api-operator/pull/1079

{}Required:{}

{}Nice to have:{}

{}ACCEPTANCE CRITERIA:{}

{}ENGINEERING DETAILS:{}

https://github.com/openshift/enhancements/pull/1267

Feature Overview

With this feature MCE will be an additional operator ready to be enabled with the creation of clusters for both the AI SaaS and disconnected installations with Agent.

Currently 4 operators have been enabled for the Assisted Service SaaS create cluster flow: Local Storage Operator (LSO), OpenShift Virtualization (CNV), OpenShift Data Foundation (ODF), Logical Volume Manager (LVM)

The Agent-based installer doesn't leverage this framework yet.

Goals

When a user performs the creation of a new OpenShift cluster with the Assisted Installer (SaaS) or with the Agent-based installer (disconnected), provide the option to enable the multicluster engine (MCE) operator.

The cluster deployed can add itself to be managed by MCE.

Background, and strategic fit

Deploying an on-prem cluster 0 easily is a key operation for the remaining of the OpenShift infrastructure.

While MCE/ACM are strategic in the lifecycle management of OpenShift, including the provisioning of all the clusters, the first cluster where MCE/ACM are hosted, along with other supporting tools to the rest of the clusters (GitOps, Quay, log centralisation, monitoring...) must be easy and with a high success rate.

The Assisted Installer and the Agent-based installers cover this gap and must present the option to enable MCE to keep making progress in this direction.

Assumptions

MCE engineering is responsible for adding the appropriate definition as an olm-operator-plugins

See https://github.com/openshift/assisted-service/blob/master/docs/dev/olm-operator-plugins.md for more details

Epic Goal

  • When an Assisted Service SaaS user performs the creation of a new OpenShift cluster, provide the option to enable the multicluster engine (MCE) operator.

Why is this important?

  • Expose users in the Assisted Service SaaS to the value of the MCE
  • Customers/users want to leverage the cluster lifecycle capabilities within MCE inside of their on premises environment.
  • The 'cluster0' can be initiated from Assisted Service SaaS and include MCE hub for cluster deployment within the customer datacenter.

Automated storage configuration

  • The Infrastructure Operator, a dependency of MCE to deploy bare metal, vSphere and Nutanix clusters, requires storage. There are 3 scenarios to automate storage:
  • User selects to install ODF and MCE:
    • ODF is the ideal storage for clusters but requires an additional subscriptions.
    • When selected along with MCE it will be configured as the storage required by the Infrastructure Operator and the Infrastructure Operator will be deployed along with MCE.
  • User deploys an SNO cluster, which supports LVMS as its storage and is available to all OpenShift users.
    • If the user also chooses ODF then ODF is used for the Infrastructure Opertor
    • If ODF isn't configured then LVMS is enabled and the Infrastructure Operator will use it.
  • User doesn't install ODF or a SNO cluster
    • They have to choose their storage and then install the Infrastructure Operator in day-2

Scenarios

  1. When a RH cloud user logs into console.redhat SaaS, they can leverage the Assisted Service SaaS flow to create a new cluster
  2. During the Assisted Service SaaS create flow, a RH cloud user can see a list of available operators that they want to install at the same time as the cluster create. 
  3. An option is offered to select check a box next to "multicluster engine for Kubernetes (MCE)" 
  4. The RH cloud user can read a tool-tip or info-box with short description of the MCE and click a link for more details to review MCE documentation

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Ensure MCE release channel can automatically deploy the latest x.y.z without needing any DevOps/SRE intervention
  • Ensure MCE release channel can be updated quickly (if not automatically) to ensure the later release x.y can be offered to the cloud user.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. for example, CNV operator: https://github.com/openshift/assisted-service/blob/master/internal/operators/cnv/manifest.go#L165

Open questions:

  1. Is there any automation that will pickup the next stable-x.y MCE or do we need to manually do it with each release? For example, when MCE 2.2 comes out do we need to update the SaaS plugin code or does it automatically move to the next.  Note for example how the OLM subscription looks - and stable-2.2 will appear once MCE 2.2 comes out.
  2. How challenging is this to maintain as new OCP releases come out and QE must be performed? 

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Authentication-operator ignores noproxy settings defined in the cluster-wide proxy.

Expected outcome: When noproxy is set, Authentication operator should initialize connections through ingress instead of the cluster-wide proxy. 

Background and Goal

Currently in OpenShift we do not support adding 3rd party agents and other software to cluster nodes. While rpm-ostree supports adding packages, we have no way today to do that in a sane, scalable way across machineconfigpools and clusters. Some customers may not be able to meet their IT policies due to this.

In addition to third party content, some customers may want to use the layering process as a point to inject configuration. The build process allows for simple copying of config files and the ability to run arbitrary scripts to set user config files (e.g. through an Ansible playbook). This should be a supported use case, except where it conflicts with OpenShift (for example, the MCO must continue to manage Cri-O and Kubelet configs).

Example Use Cases

  • Bare metal firmware update software that is packaged as an RPM
  • Host security monitors
  • Forensic tools
  • SEIM logging agents
  • SSH Key management
  • Device Drivers from OEM/ODM partners

Acceptance Criteria

  1. Administrators can deploy 3rd party repositories and packages to MachineConfigPools.
  2. Administrators can easily remove added packages and repository files.
  3. Administrators can manage system configuration files by copying files into the RHCOS build. [Note: if the same file is managed by the MCO, the MachineConfig version of the file is expected to "win" over the OS image version.]

Background

As part of enabling OCP CoreOS Layering for third party components, we will need to allow for package installation to /opt. Many OEMs and ISVs install to /opt and it would be difficult for them to make the change only for RHCOS. Meanwhile changing their RHEL target to a different target would also be problematic as their customers are expecting these tools to install in a certain way. Not having to worry about this path will provide the best ecosystem partner and customer experience.

Requirements

  • Document how 3rd party vendors can be compatible with our current offering.
  • Provide mechanism for 3rd party vendors or their customers to provide information for exceptions that require an RPM to install binaries to /opt as an install target path.

Feature Overview (aka. Goal Summary)  

Add support for custom security groups to be attached to control plane and compute nodes at installation time.

Goals (aka. expected user outcomes)

Allow the user to provide existing security groups to be attached to the control plane and compute node instances at installation time.

Requirements (aka. Acceptance Criteria):

The user will be able to provide a list of existing security groups to the install config manifest that will be used as additional custom security groups to be attached to the control plane and compute node instances at installation time.

Out of Scope

The installer won't be responsible of creating any custom security groups, these must be created by the user before the installation starts.

Background

We do have users/customers with specific requirements on adding additional network rules to every instance created in AWS. For OpenShift these additional rules need to be added on day-2 manually as the Installer doesn't provide the ability to add custom security groups to be attached to any instance at install time.

MachineSets already support adding a list of existing custom security groups, so this could be automated already at install time manually editing each MachineSet manifest before starting the installation, but even for these cases the Installer doesn't allow the user to provide this information to add the list of these security groups to the MachineSet manifests.

Documentation Considerations

Documentation will be required to explain how this information needs to be provided to the install config manifest as any other supported field.

Epic Goal

  • Allow the user to provide existing security groups to be attached to the control plane and compute node instances at installation time.

Why is this important?

  • We do have users/customers with specific requirements on adding additional network rules to every instance created in AWS. For OpenShift these additional rules need to be added on day-2 manually as the Installer doesn't provide the ability to add custom security groups to be attached to any instance at install time.

    MachineSets already support adding a list of existing custom security groups, so this could be automated already at install time manually editing each MachineSet manifest before starting the installation, but even for these cases the Installer doesn't allow the user to provide this information to add the list of these security groups to the MachineSet manifests.

Scenarios

  1. The user will be able to provide a list of existing security groups to the install config that will be used as additional custom security groups to be attached to the control plane and compute node instances at installation time.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Previous Work (Optional):

  1. Compute Nodes managed by MAPI already support this feature

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Add custom security groups for compute nodes
  • Add custom security groups for control plane nodes

so that I can achieve

  • Control Plane and Compute nodes can support operational specific security rules. For instance: specific traffic may be required for compute vs control plane nodes.

Acceptance Criteria:

Description of criteria:

  • The control plane and compute machine sections of the install config accept user input as additionalSecurityGroupIDs (when using the aws platform).

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

  •  
    additionalSecurityGroupIDs:
      description: AdditionalSecurityGroupIDs contains IDs of
        additional security groups for machines, where each ID
        is presented in the format sg-xxxx.
      items:
        type: string
      type: array 

 

This requires/does not require a design proposal.

Feature Overview (aka. Goal Summary)  

Scaling of pod in Openshift highly depends on customer workload and their hardware setup . Some workloads on certain hardware might not scale beyond 100 pods and others might scale to 1000 pods . 

As a openshift admin i want to monitor metrics that will indicate why i am not able to scale my pods . think of pressure gauge that will tell customer when its green ( can scale) when its red ( not scale)

As a openshift support team if a customer call in with their complain about pod scaling then i should be able to check some metrics and inform them why they are not able to scale 

Goals (aka. expected user outcomes)

Metrics and alert and dashboard 

 

Requirements (aka. Acceptance Criteria):

able to integrate these metrics and alert in a monitoring dashboard 

 

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • To come up with set of metrics that indicate optimal node resource usage.

Why is this important?

  • These metrics will help customers to understand the capacity they have instead of restricting themselves to hard coded max pod limit.

Scenarios

  1. As a owner of extremely high capacity machine, I want to be able to deploy as many pods as my machine can handle. 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. None

Previous Work (Optional):

  1. https://issues.redhat.com/browse/OCPNODE-1125

Open questions::

  1. The challenging part is come up with set of metrics that accurately indicate system resource usage.  

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

We need to have an operator to inject dashboard jsonnet. E.g. etcd team injects their dashboard jsonnet using their operator in the form of a config map. 

https://redhat-internal.slack.com/archives/C027U68LP/p1683574004805639?thread_ts=1683573783.216759&cid=C027U68LP

 

We will need similar approach for node dashboard. 

In this feature will follow up OCPBU-186 Image mirroring by tags.

OCPBU-186 implemented new API ImageDigestMirrorSet and ImageTagMirrorSet and rolling of them through MCO.

This feature will update the components using ImageContentSourcePolicy to use ImageDigestMirrorSet.

The list of the components: https://docs.google.com/document/d/11FJPpIYAQLj5EcYiJtbi_bNkAcJa2hCLV63WvoDsrcQ/edit?usp=sharing.

 

Migrate OpenShift Components to use the new Image Digest Mirror Set (IDMS)

This doc list openshift components currently use ICSP: https://docs.google.com/document/d/11FJPpIYAQLj5EcYiJtbi_bNkAcJa2hCLV63WvoDsrcQ/edit?usp=sharing

Plan for ImageDigestMirrorSet Rollout
Epic: https://issues.redhat.com/browse/OCPNODE-521

4.13: Enable ImageDigestMirrorSet, both ICSP and ImageDigestMirrorSet objects are functional

  • Document that ICSP is being deprecated and will be unsupported by 4.17 (to allow for EUS to EUS upgrades)
  • Reject write to both ICSP and ImageDigestMirrorSet on the same cluster

4.14: Update OpenShift components to use IDMS

4.17: Remove support for ICSP within MCO

  • Error out if an old ICSP object is used

As a <openshift developer> trying to <mirror image for disconnect environment using oc command> I want <the output give the example of ImageDigestMirrorSet manifest> because ImageContentSourcePolicy will be replaced by CRD implemented in OCPBU-186 Image mirroring by tags

the ImageContentSourcePolicy manifest snippet from the command output will be updated to ImageDigestMirrorSet manifest.{}

workloads uses `oc adm release mirror` command will be impacted.

 

 

As an openshift developer, I want --idms-file flag so that I can fetch image info from alternative mirror if --icsp-file gets deprecated.

Feature Overview

Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

 
Goals

  • Functionality on GCP Tech Preview
  • inclusion in the cluster backups
  • flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

This epic covers the work to apply user defined labels GCP resources created for openshift cluster available as tech preview.

The user should be able to define GCP labels to be applied on the resources created during cluster creation by the installer and other operators which manages the specific resources. The user will be able to define the required tags/labels in the install-config.yaml while preparing with the user inputs for cluster creation, which will then be made available in the status sub-resource of Infrastructure custom resource which cannot be edited but will be available for user reference and will be used by the in-cluster operators for labeling when the resources are created.

Updating/deleting of labels added during cluster creation or adding new labels as Day-2 operation is out of scope of this epic.

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

Reference - https://issues.redhat.com/browse/RFE-2017

Enhancement proposed for GCP tags support in OCP, requires cluster-image-registry-operator to add gcp userTags available in the status sub resource of infrastructure CR, to the gcp storage resource created.

cluster-image-registry-operator uses the method createStorageAccount() to create storage resource which should be updated to add tags after resource creation.

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • UTs and e2e are added/updated

cluster-config-operator makes Infrastructure CRD available for installer, which is included in it's container image from the openshift/api package and requires the package to be updated to have the latest CRD.

Installer generates Infrastructure CR in manifests creation step of cluster creation process based on the user provided input recorded in install-config.yaml. While generating Infrastructure CR platformStatus.gcp.resourceLabels should be updated with the user provided labels(installconfig.platform.gcp.userLabels).

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • Infrastructure CR created by installer should have gcp user defined labels if any, in status field.

Enhancement proposed for Azure tags support in OCP, requires install-config CRD to be updated to include gcp userLabels for user to configure, which will be referred by the installer to apply the list of labels on each resource created by it and as well make it available in the Infrastructure CR created.

Below is the snippet of the change required in the CRD

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata: 
  name: installconfigs.install.openshift.io
spec: 
  versions: 
  - name: v1
    schema: 
      openAPIV3Schema: 
        properties: 
          platform: 
            properties: 
              gcp: 
                properties: 
                  userLabels: 
                    additionalProperties: 
                      type: string
                    description: UserLabels additional keys and values that the installer
                      will add as labels to all resources that it creates. Resources
                      created by the cluster itself may not include these labels.
                  type: object

This change is required for testing the changes of the feature, and should ideally get merged first.

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • User should be able to configure gcp user defined labels in the install-config.yaml
  • Fields descriptions

Enhancement proposed for GCP labels support in OCP, requires machine-api-provider-gcp to add azure userLabels available in the status sub resource of infrastructure CR, to the gcp virtual machines resource and the sub-resources created.

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • UTs and e2e are added/updated

Enhancement proposed for GCP labels and tags support in OCP requires making use of latest APIs made available in terraform provider for google and requires an update to use the same.

Acceptance Criteria

  • Code linting, validation and best practices adhered to.

Installer creates below list of gcp resources during create cluster phase and these resources should be applied with the user defined labels and the default OCP label kubernetes-io-cluster-<cluster_id>:owned

Resources List

Resource Terraform API
VM Instance google_compute_instance
Image google_compute_image
Address google_compute_address(beta)
ForwardingRule google_compute_forwarding_rule(beta)
Zones google_dns_managed_zone
Storage Bucket google_storage_bucket

Acceptance Criteria:

  • Code linting, validation and best practices adhered to
  • List of gcp resources created by installer should have user defined labels and as well as the default OCP label.

Enhancement proposed for GCP labels support in OCP, requires cluster-image-registry-operator to add gcp userLabels available in the status sub resource of infrastructure CR, to the gcp storage resource created.

cluster-image-registry-operator uses the method createStorageAccount() to create storage resource which should be updated to add labels.

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • UTs and e2e are added/updated

Feature Overview  

Much like core OpenShift operators, a standardized flow exists for OLM-managed operators to interact with the cluster in a specific way to leverage AWS STS authorization when using AWS APIs as opposed to insecure static, long-lived credentials. OLM-managed operators can implement integration with the CloudCredentialOperator in well-defined way to support this flow.

Goals:

Enable customers to easily leverage OpenShift's capabilities around AWS STS with layered products, for increased security posture. Enable OLM-managed operators to implement support for this in well-defined pattern.

Requirements:

  • CCO gets a new mode in which it can reconcile STS credential request for OLM-managed operators
  • A standardized flow is leveraged to guide users in discovering and preparing their AWS IAM policies and roles with permissions that are required for OLM-managed operators 
  • A standardized flow is defined in which users can configure OLM-managed operators to leverage AWS STS
  • An example operator is used to demonstrate the end2end functionality
  • Clear instructions and documentation for operator development teams to implement the required interaction with the CloudCredentialOperator to support this flow

Use Cases:

See Operators & STS slide deck.

 

Out of Scope:

  • handling OLM-managed operator updates in which AWS IAM permission requirements might change from one version to another (which requires user awareness and intervention)

 

Background:

The CloudCredentialsOperator already provides a powerful API for OpenShift's cluster core operator to request credentials and acquire them via short-lived tokens. This capability should be expanded to OLM-managed operators, specifically to Red Hat layered products that interact with AWS APIs. The process today is cumbersome to none-existent based on the operator in question and seen as an adoption blocker of OpenShift on AWS.

 

Customer Considerations

This is particularly important for ROSA customers. Customers are expected to be asked to pre-create the required IAM roles outside of OpenShift, which is deemed acceptable.

Documentation Considerations

  • Internal documentation needs to exists to guide Red Hat operator developer teams on the requirements and proposed implementation of integration with CCO and the proposed flow
  • External documentation needs to exist to guide users on:
    • how to become aware that the cluster is in STS mode
    • how to become aware of operators that support STS and the proposed CCO flow
    • how to become aware of the IAM permissions requirements of these operators
    • how to configure an operator in the proposed flow to interact with CCO

Interoperability Considerations

  • this needs to work with ROSA
  • this needs to work with self-managed OCP on AWS

Market Problem

This Section: High-Level description of the Market Problem ie: Executive Summary

  • As a customer of OpenShift layered products, I need to be able to fluidly, reliably and consistently install and use OpenShift layered product Kubernetes Operators into my ROSA STS clusters, while keeping a STS workflow throughout.
  •  
  • As a customer of OpenShift on the big cloud providers, overall I expect OpenShift as a platform to function equally well with tokenized cloud auth as it does with "mint-mode" IAM credentials. I expect the same from the Kubernetes Operators under the Red Hat brand (that need to reach cloud APIs) in that tokenized workflows are equally integrated and workable as with "mint-mode" IAM credentials.
  •  
  • As the managed services, including Hypershift teams, offering a downstream opinionated, supported and managed lifecycle of OpenShift (in the forms of ROSA, ARO, OSD on GCP, Hypershift, etc), the OpenShift platform should have as close as possible, native integration with core platform operators when clusters use tokenized cloud auth, driving the use of layered products.
  • .
  • As the Hypershift team, where the only credential mode for clusters/customers is STS (on AWS) , the Red Hat branded Operators that must reach the AWS API, should be enabled to work with STS credentials in a consistent, and automated fashion that allows customer to use those operators as easily as possible, driving the use of layered products.

Why it Matters

  • Adding consistent, automated layered product integrations to OpenShift would provide great added value to OpenShift as a platform, and its downstream offerings in Managed Cloud Services and related offerings.
  • Enabling Kuberenetes Operators (at first, Red Hat ones) on OpenShift for the "big3" cloud providers is a key differentiation and security requirement that our customers have been and continue to demand.
  • HyperShift is an STS-only architecture, which means that if our layered offerings via Operators cannot easily work with STS, then it would be blocking us from our broad product adoption goals.

Illustrative User Stories or Scenarios

  1. Main success scenario - high-level user story
    1. customer creates a ROSA STS or Hypershift cluster (AWS)
    2. customer wants basic (table-stakes) features such as AWS EFS or RHODS or Logging
    3. customer sees necessary tasks for preparing for the operator in OperatorHub from their cluster
    4. customer prepares AWS IAM/STS roles/policies in anticipation of the Operator they want, using what they get from OperatorHub
    5. customer's provides a very minimal set of parameters (AWS ARN of role(s) with policy) to the Operator's OperatorHub page
    6. The cluster can automatically setup the Operator, using the provided tokenized credentials and the Operator functions as expected
    7. Cluster and Operator upgrades are taken into account and automated
    8. The above steps 1-7 should apply similarly for Google Cloud and Microsoft Azure Cloud, with their respective token-based workload identity systems.
  2. Alternate flow/scenarios - high-level user stories
    1. The same as above, but the ROSA CLI would assist with AWS role/policy management
    2. The same as above, but the oc CLI would assist with cloud role/policy management (per respective cloud provider for the cluster)
  3. ...

Expected Outcomes

This Section: Articulates and defines the value proposition from a users point of view

  • See SDE-1868 as an example of what is needed, including design proposed, for current-day ROSA STS and by extension Hypershift.
  • Further research is required to accomodate the AWS STS equivalent systems of GCP and Azure
  • Order of priority at this time is
    • 1. AWS STS for ROSA and ROSA via HyperShift
    • 2. Microsoft Azure for ARO
    • 3. Google Cloud for OpenShift Dedicated on GCP

Effect

This Section: Effect is the expected outcome within the market. There are two dimensions of outcomes; growth or retention. This represents part of the “why” statement for a feature.

  • Growth is the acquisition of net new usage of the platform. This can be new workloads not previously able to be supported, new markets not previously considered, or new end users not previously served.
  • Retention is maintaining and expanding existing use of the platform. This can be more effective use of tools, competitive pressures, and ease of use improvements.
  • Both of growth and retention are the effect of this effort.
    • Customers have strict requirements around using only token-based cloud credential systems for workloads in their cloud accounts, which include OpenShift clusters in all forms.
      • We gain new customers from both those that have waited for token-based auth/auth from OpenShift and from those that are new to OpenShift, with strict requirements around cloud account access
      • We retain customers that are going thru both cloud-native and hybrid-cloud journeys that all inevitably see security requirements driving them towards token-based auth/auth.
      •  

References

As an engineer I want the capability to implement CI test cases that run at different intervals, be it daily, weekly so as to ensure downstream operators that are dependent on certain capabilities are not negatively impacted if changes in systems CCO interacts with change behavior.

Acceptance Criteria:

Create a stubbed out e2e test path in CCO and matching e2e calling code in release such that there exists a path to tests that verify working in an AWS STS workflow.

OC mirror is GA product as of Openshift 4.11 .

The goal of this feature is to solve any future customer request for new features or capabilities in OC mirror 

In 4.12 release, a new feature was introduced to oc-mirror allowing it to use OCI FBC catalogs as starting point for mirroring operators.

Overview

As a oc-mirror user, I would like the OCI FBC feature to be stable
so that I can use it in a production ready environment
and to make the new feature and all existing features of oc-mirror seamless

Current Status

This feature is ring-fenced in the oc mirror repository, it uses the following flags to achieve this so as not to cause any breaking changes in the current oc-mirror functionality.

  • --use-oci-feature
  • --oci-feature-action (copy or mirror)
  • --oci-registries-config

The OCI FBC (file base container) format has been delivered for Tech Preview in 4.12

Tech Enablement slides can be found here https://docs.google.com/presentation/d/1jossypQureBHGUyD-dezHM4JQoTWPYwiVCM3NlANxn0/edit#slide=id.g175a240206d_0_7

Design doc is in https://docs.google.com/document/d/1-TESqErOjxxWVPCbhQUfnT3XezG2898fEREuhGena5Q/edit#heading=h.r57m6kfc2cwt (also contains latest design discussions around the stories of this epic)

Link to previous working epic https://issues.redhat.com/browse/CFE-538

Contacts for the OCI FBC feature

 

Feature Overview (aka. Goal Summary)  

The OpenShift Assisted Installer is a user-friendly OpenShift installation solution for the various platforms, but focused on bare metal. This very useful functionality should be made available for the IBM zSystem platform.

 

Goals (aka. expected user outcomes)

Use of the OpenShift Assisted Installer to install OpenShift on an IBM zSystem

 

Requirements (aka. Acceptance Criteria):

Using the OpenShift Assisted Installer to install OpenShift on an IBM zSystem 

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

As a multi-arch development engineer, I would like to ensure that the Assisted Installer workflow is fully functional and supported for z/VM deployments.

Acceptance Criteria

  • Feature is implemented, tested, QE, documented, and technically enabled.
  • Stories closed.

Discovering an regression on staging where default is set to minimal ISO preventing installation of OCP 4.13 for s390x architecture.

See following older bugs addressing the same issue I guess

  1. MGMT-14298

 

Description of the problem:

Using FCP (multipath) devices for zVM node 
parmline:

rd.neednet=1 console=ttysclp0 coreos.live.rootfs_url=http://172.23.236.156:8080/assisted-installer/rootfs.img ip=10.14.6.8::10.14.6.1:255.255.255.0:master-0:encbdd0:none nameserver=10.14.6.1 ip=[fd00::8]::[fd00::1]:64::encbdd0:none nameserver=[fd00::1] zfcp.allow_lun_scan=0 rd.znet=qeth,0.0.bdd0,0.0.bdd1,0.0.bdd2,layer2=1 rd.zfcp=0.0.8007,0x500507630400d1e3,0x4000401e00000000 rd.zfcp=0.0.8107,0x50050763040851e3,0x4000401e00000000 random.trust_cpu=on rd.luks.options=discard ignition.firstboot ignition.platform.id=metal console=tty1 console=ttyS1,115200n8

shows disk limitation error in the UI. 

<see attached image>

How reproducible:

Attach two FCP devices to a zVM node. Create a cluster and boot zVM node into discovery service. Host discovery panel shows an error for discovered host.

Steps to reproduce:

1. Attach two FCP devices to the zVM.

2. Create new cluster using the AI UI and configure discovery image

3. Boot zVM node 

4. Waiting until node is showing up on the Host discovery panel.

5. FCP devices are not recognized as valid option

Actual results:

FCP devices can't be used as installable disk

Expected results:
FCP device can be used for installation (multipath must be activated after installation:
https://docs.openshift.com/container-platform/4.13/post_installation_configuration/ibmz-post-install.html#enabling-multipathing-fcp-luns_post-install-configure-additional-devices-ibmz)

Description of the problem:

Using DASD devices are not recognized correctly if attached and used for a zVM node.
<see attached screenshot>

Attach two FCP devices to a zVM node. Create a cluster and boot zVM node into discovery service. Host discovery panel shows an error for discovered host.

Steps to reproduce:

1. Attach two DASD devices to the zVM.

2. Create new cluster using the AI UI and configure discovery image

3. Boot zVM node 

4. Waiting until node is showing up on the Host discovery panel.

5. DASD devices are not recognized as valid option

Actual results:

DASD devices can't be used as installable disk

Expected results:
DASD device can be used for installation. User can choose the on which device AI will install to.

Feature Overview (aka. Goal Summary)  

Due to low customer interest of using Openshift on Alibaba cloud we have decided to deprecate then remove the IPI support for ALibaba Cloud 

https://docs.google.com/document/d/1Kp-GrdSHqsymzezLCm0bKrCI71alup00S48QeWFa0q8/edit#heading=h.v75efohim75y 

Goals (aka. expected user outcomes)

4.14

Announcement 

  1. Update cloud.redhat.com with deprecation information 
  2. Update IPI installer code with warning
  3. Update release node with deprecation information
  4. Update Openshift Doc with deprecation information

4.15

Archive code 

 

Add a warning of depreciation in installer code for anyone trying to install Alibaba via IPI

{}USER STORY:{}

As an user of the installer binary, I want to be warned that Alibaba support will be deprecated in 4.15, so that I'm prevented from creating clusters that will soon be unsupported.

{}DESCRIPTION:{}

Alibaba support will be decommissioned from both IPI and UPI starting in 4.15. We want to warn users of the 4.14 installer binary picking 'alibabacloud' in the list of providers.

{}ACCEPTANCE CRITERIA:{}

Warning message is displayed after choosing 'alibabacloud'.

{}ENGINEERING DETAILS:{}

https://docs.google.com/document/d/1Kp-GrdSHqsymzezLCm0bKrCI71alup00S48QeWFa0q8/edit?usp=sharing_eip_m&ts=647df877

 

Epic Goal

As an OpenShift infrastructure owner I need to deploy OCP on OpenStack with the installer-provisioned infrastructure workflow and configure my own load balancers

Why is this important?

Customers want to use their own load balancers and IPI comes with built-in LBs based in keepalived and haproxy. 

Scenarios

  1. A large deployment routed across multiple failure domains without stretched L2 networks, would require to dynamically route the control plane VIP traffic through load-balancers capable of living in multiple L2.
  2. Customers who want to use their existing LB appliances for the control plane.

Acceptance Criteria

  • Should we require the support of migration from internal to external LB?
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • QE - must be testing a scenario where we disable the internal LB and setup an external LB and OCP deployment is running fine.
  • Documentation - we need to document all the gotchas regarding this type of deployment, even the specifics about the load-balancer itself (routing policy, dynamic routing, etc)

Dependencies (internal and external)

  1. Fixed IPs would be very interesting to support, already WIP by vsphere (need to Spike on this): https://issues.redhat.com/browse/OCPBU-179
  2. Confirm with customers that they are ok with external LB or they prefer a new internal LB that supports BGP

Previous Work:

vsphere has done the work already via https://issues.redhat.com/browse/SPLAT-409

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

As an OpenShift infrastructure owner I need to deploy OCP on OpenStack with the installer-provisioned infrastructure workflow and configure my own load balancers

Why is this important?

Customers want to use their own load balancers and IPI comes with built-in LBs based in keepalived and haproxy. 

Scenarios

  1. A large deployment routed across multiple failure domains without stretched L2 networks, would require to dynamically route the control plane VIP traffic through load-balancers capable of living in multiple L2.
  2. Customers who want to use their existing LB appliances for the control plane.

Acceptance Criteria

  • Should we require the support of migration from internal to external LB?
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • QE - must be testing a scenario where we disable the internal LB and setup an external LB and OCP deployment is running fine.
  • Documentation - we need to document all the gotchas regarding this type of deployment, even the specifics about the load-balancer itself (routing policy, dynamic routing, etc)

Dependencies (internal and external)

  1. Fixed IPs would be very interesting to support, already WIP by vsphere (need to Spike on this): https://issues.redhat.com/browse/OCPBU-179
  2. Confirm with customers that they are ok with external LB or they prefer a new internal LB that supports BGP

Previous Work:

vsphere has done the work already via https://issues.redhat.com/browse/SPLAT-409

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

Notes: https://github.com/EmilienM/ansible-role-routed-lb is an example of a LB that will be used for CI, can be used by QE and customers.

Epic Goal

As an OpenShift infrastructure owner I need to deploy OCP on OpenStack with the installer-provisioned infrastructure workflow and configure my own load balancers

Why is this important?

Customers want to use their own load balancers and IPI comes with built-in LBs based in keepalived and haproxy. 

Scenarios

  1. A large deployment routed across multiple failure domains without stretched L2 networks, would require to dynamically route the control plane VIP traffic through load-balancers capable of living in multiple L2.
  2. Customers who want to use their existing LB appliances for the control plane.

Acceptance Criteria

  • Should we require the support of migration from internal to external LB?
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • QE - must be testing a scenario where we disable the internal LB and setup an external LB and OCP deployment is running fine.
  • Documentation - we need to document all the gotchas regarding this type of deployment, even the specifics about the load-balancer itself (routing policy, dynamic routing, etc)

Dependencies (internal and external)

  1. Fixed IPs would be very interesting to support, already WIP by vsphere (need to Spike on this): https://issues.redhat.com/browse/OCPBU-179
  2. Confirm with customers that they are ok with external LB or they prefer a new internal LB that supports BGP

Previous Work:

vsphere has done the work already via https://issues.redhat.com/browse/SPLAT-409

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

Goals

  • Support OpenShift to be deployed from day-0 on AWS Local Zones
  • Support an existing OpenShift cluster to deploy compute Nodes on AWS Local Zones (day-2)

AWS Local Zones support - feature delivered in phases:

  • Phase 0 (OCPPLAN-9630): Document how to create compute nodes on AWS Local Zones in day-0 (SPLAT-635)
  • Phase 1 ( OCPBU-2): Create edge compute pool to generate MachineSets for node with NoSchedule taints when installing a cluster in existing VPC with AWS Local Zone subnets (SPLAT-636)
  • Phase 2 (OCPBU-351): Installer automates network resources creation on Local Zone based on the edge compute pool (SPLAT-657)

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

<!--

Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:

https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/

As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.

Before submitting it, please make sure to remove all comments like this one.

-->

{}USER STORY:{}

<!--

One sentence describing this story from an end-user perspective.

-->

As a [type of user], I want [an action] so that [a benefit/a value].

{}DESCRIPTION:{}

<!--

Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.

-->

{}Required:{}

...

{}Nice to have:{}

...

{}ACCEPTANCE CRITERIA:{}

<!--

Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.

-->

{}ENGINEERING DETAILS:{}

<!--

Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.

-->

Feature Overview

Testing is one of the main pillars of production-grade software. It helps validate and flag issues early on before the code is shipped into productive landscapes. Code changes no matter how small they are might lead to bugs and outages, the best way to validate bugs is to write proper tests, and to run those tests we need to have a foundation for a test infrastructure, finally, to close the circle, automation of these tests and their corresponding build help reduce errors and save a lot of time.

Goal(s)

  • How do we get infrastructure, what infrastructure accounts are required?
  • Build e2e integration with openshift-release on AWS.
  • Define MVP CI Jobs to validate (e.g., conformance). What tests are failing, are we skipping any? why? 

Note: Sync with the Developer productivity teams might be required to understand infra requirements especially for our first HyperShift infrastructure backend, AWS.

Context:

This is a placeholder epic to capture all the e2e scenarios that we want to test in CI in the long term. Anything which is a TODO here should at minimum be validated by QE as it is developed.

DoD:

Every supported scenario is e2e CI tested.

Scenarios:

  • Hypershift deployment with services as routes.
  • Hypershift deployment with services as NodePorts.

 

DoD:

Refactor the E2E tests following new pattern with 1 HostedCluster and targeted NodePools:

  • nodepool_upgrade_test.go

 

Goal

Productize agent-installer-utils container from https://github.com/openshift/agent-installer-utils

Feature Description

In order to ship the network reconfiguration it would be useful to move the agent-tui to its own image instead of sharing the agent-installer-node-agent one.

Goal

Productize agent-installer-utils container from https://github.com/openshift/agent-installer-utils

Feature Description

In order to ship the network reconfiguration it would be useful to move the agent-tui to its own image instead of sharing the agent-installer-node-agent one.

agent-tui is currently built and shipped using the assisted-installer-agent repo. Since it will be move into its own repository (agent-installer-utils), it's necessary to cleanup the previous code.

Currently the `agent create image` command takes care to extract the agent-tui binary (and required libs) from the `assisted-installer-agent` image (shipped in the release as `agent-installer-node-agent`).
Once the agent-tui will be available instead from the `agent-installer-utils` image, it would be necessary to update accordingly the installer code (see https://github.com/openshift/installer/blob/56e85bee78490c18aaf33994e073cbc16181f66d/pkg/asset/agent/image/agentimage.go#L81)

Feature Overview

Allow users to interactively adjust the network configuration for a host after booting the agent ISO.

Goals

Configure network after host boots

The user has Static IPs, VLANs, and/or bonds to configure, but has no idea of the device names of the NICs. They don't enter any network config in agent-config.yaml. Instead they configure each host's network via the text console after it boots into the image.

Epic Goal

  • Allow users to interactively adjust the network configuration for a host after booting the agent ISO, before starting processes that pull container images.

Why is this important?

  • Configuring the network prior to booting a host is difficult and error-prone. Not only is the nmstate syntax fairly arcane, but the advent of 'predictable' interface names means that interfaces retain the same name across reboots but it is nearly impossible to predict what they will be. Applying configuration to the correct hosts requires correct knowledge and input of MAC addresses. All of these present opportunities for things to go wrong, and when they do the user is forced to return to the beginning of the process and generate a new ISO, then boot all of the hosts in the cluster with it again.

Scenarios

  1. The user has Static IPs, VLANs, and/or bonds to configure, but has no idea of the device names of the NICs. They don't enter any network config in agent-config.yaml. Instead they configure each host's network via the text console after it boots into the image.
  2. The user has Static IPs, VLANs, and/or bonds to configure, but makes an error entering the configuration in agent-config.yaml so that (at least) one host will not be able to pull container images from the release payload. They correct the configuration for that host via the text console before proceeding with the installation.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

The openshift-install agent create image will need to fetch the agent-tui executable so that it could be embedded within the agent ISO. For this reason the agent-tui must be available in the release payload, so that it could be retrieved even when the command is invoked in a disconnected environment.

When the agent-tui is shown during the initial host boot, if the pull release image check fails then an additional checks box is shown along with a details text view.
The content of the details view gets continuosly updated with the details of failed check, but the user cannot move the focus over the details box (using the arrow/tab keys), thus cannot scroll its content (using the up/down arrow keys)

Currently the agent-tui displays always the additional checks (nslookup/ping/http get), even when the primary check (pull image) passes. This may cause some confusion to the user, due the fact that the additional checks do not prevent the agent-tui to complete successfully but they are just informative, to allow a better troubleshooting of the issue (so not needed in the positive case).

The additional checks should then be shown only when the primary check fails for any reason.

When the UI is active in the console events messages that are generated will distort the interface and make it difficult for the user to view the configuration and select options. An example is shown in the attached screenshot.

Epic Goal

Full support of North-South (cluster egress-ingress) IPsec that shares an encryption back-end with the current East-West implementation, allows for IPsec offload to capable SmartNICs, can be enabled and disabled at runtime, and allows for FIPS compliance (including install-time configuration and disabling of runtime configuration).

Why is this important?

  • Customers went end-to-end default encryption with external servers and/or clients. 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Must allow for the possibility of offloading the IPsec encryption to a SmartNIC.
  •  

Dependencies (internal and external)

  1.  

Related:

  • ITUP-44 - OpenShift support for North-South OVN IPSec
  • HATSTRAT-33 - Encrypt All Traffic to/from Cluster (aka IPSec as a Service)

Previous Work (Optional):

  1. SDN-717 - Support IPSEC on ovn-kubernetes

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

This is a clone of issue OCPBUGS-17380. The following is the description of the original issue:

Description of problem:

Enable IPSec pre/post install on OVN IC cluster

$ oc patch networks.operator.openshift.io cluster --type=merge -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipsecConfig":{ }}}}}'
network.operator.openshift.io/cluster patched


ovn-ipsec containers complaining:

ovs-monitor-ipsec | ERR | Failed to import certificate into NSS.
b'certutil:  unable to open "/etc/openvswitch/keys/ipsec-cacert.pem" for reading (-5950, 2).\n'



$ oc rsh ovn-ipsec-d7rx9
Defaulted container "ovn-ipsec" out of: ovn-ipsec, ovn-keys (init)
sh-5.1# certutil -L -d /var/lib/ipsec/nss Certificate Nickname                                         Trust Attributes
                                                             SSL,S/MIME,JAR/XPIovs_certkey_db961f9a-7de4-4f1d-a2fb-a8306d4079c5             u,u,u 

sh-5.1# cat /var/log/openvswitch/libreswan.log
Aug  4 15:12:46.808394: Initializing NSS using read-write database "sql:/var/lib/ipsec/nss"
Aug  4 15:12:46.837350: FIPS Mode: NO
Aug  4 15:12:46.837370: NSS crypto library initialized
Aug  4 15:12:46.837387: FIPS mode disabled for pluto daemon
Aug  4 15:12:46.837390: FIPS HMAC integrity support [disabled]
Aug  4 15:12:46.837541: libcap-ng support [enabled]
Aug  4 15:12:46.837550: Linux audit support [enabled]
Aug  4 15:12:46.837576: Linux audit activated
Aug  4 15:12:46.837580: Starting Pluto (Libreswan Version 4.9 IKEv2 IKEv1 XFRM XFRMI esp-hw-offload FORK PTHREAD_SETSCHEDPRIO GCC_EXCEPTIONS NSS (IPsec profile) (NSS-KDF) DNSSEC SYSTEMD_WATCHDOG LABELED_IPSEC (SELINUX) SECCOMP LIBCAP_NG LINUX_AUDIT AUTH_PAM NETWORKMANAGER CURL(non-NSS) LDAP(non-NSS)) pid:147
Aug  4 15:12:46.837583: core dump dir: /run/pluto
Aug  4 15:12:46.837585: secrets file: /etc/ipsec.secrets
Aug  4 15:12:46.837587: leak-detective enabled
Aug  4 15:12:46.837589: NSS crypto [enabled]
Aug  4 15:12:46.837591: XAUTH PAM support [enabled]
Aug  4 15:12:46.837604: initializing libevent in pthreads mode: headers: 2.1.12-stable (2010c00); library: 2.1.12-stable (2010c00)
Aug  4 15:12:46.837664: NAT-Traversal support  [enabled]
Aug  4 15:12:46.837803: Encryption algorithms:
Aug  4 15:12:46.837814:   AES_CCM_16         {256,192,*128} IKEv1:     ESP     IKEv2:     ESP     FIPS              aes_ccm, aes_ccm_c
Aug  4 15:12:46.837820:   AES_CCM_12         {256,192,*128} IKEv1:     ESP     IKEv2:     ESP     FIPS              aes_ccm_b
Aug  4 15:12:46.837826:   AES_CCM_8          {256,192,*128} IKEv1:     ESP     IKEv2:     ESP     FIPS              aes_ccm_a
Aug  4 15:12:46.837831:   3DES_CBC           [*192]         IKEv1: IKE ESP     IKEv2: IKE ESP     FIPS NSS(CBC)     3des
Aug  4 15:12:46.837837:   CAMELLIA_CTR       {256,192,*128} IKEv1:     ESP     IKEv2:     ESP                      
Aug  4 15:12:46.837843:   CAMELLIA_CBC       {256,192,*128} IKEv1: IKE ESP     IKEv2: IKE ESP          NSS(CBC)     camellia
Aug  4 15:12:46.837849:   AES_GCM_16         {256,192,*128} IKEv1:     ESP     IKEv2: IKE ESP     FIPS NSS(GCM)     aes_gcm, aes_gcm_c
Aug  4 15:12:46.837855:   AES_GCM_12         {256,192,*128} IKEv1:     ESP     IKEv2: IKE ESP     FIPS NSS(GCM)     aes_gcm_b
Aug  4 15:12:46.837861:   AES_GCM_8          {256,192,*128} IKEv1:     ESP     IKEv2: IKE ESP     FIPS NSS(GCM)     aes_gcm_a
Aug  4 15:12:46.837867:   AES_CTR            {256,192,*128} IKEv1: IKE ESP     IKEv2: IKE ESP     FIPS NSS(CTR)     aesctr
Aug  4 15:12:46.837872:   AES_CBC            {256,192,*128} IKEv1: IKE ESP     IKEv2: IKE ESP     FIPS NSS(CBC)     aes
Aug  4 15:12:46.837878:   NULL_AUTH_AES_GMAC {256,192,*128} IKEv1:     ESP     IKEv2:     ESP     FIPS              aes_gmac
Aug  4 15:12:46.837883:   NULL               []             IKEv1:     ESP     IKEv2:     ESP                      
Aug  4 15:12:46.837889:   CHACHA20_POLY1305  [*256]         IKEv1:             IKEv2: IKE ESP          NSS(AEAD)    chacha20poly1305
Aug  4 15:12:46.837892: Hash algorithms:
Aug  4 15:12:46.837896:   MD5                               IKEv1: IKE         IKEv2:                  NSS         
Aug  4 15:12:46.837901:   SHA1                              IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha
Aug  4 15:12:46.837906:   SHA2_256                          IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha2, sha256
Aug  4 15:12:46.837910:   SHA2_384                          IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha384
Aug  4 15:12:46.837915:   SHA2_512                          IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha512
Aug  4 15:12:46.837919:   IDENTITY                          IKEv1:             IKEv2:             FIPS             
Aug  4 15:12:46.837922: PRF algorithms:
Aug  4 15:12:46.837927:   HMAC_MD5                          IKEv1: IKE         IKEv2: IKE              native(HMAC) md5
Aug  4 15:12:46.837931:   HMAC_SHA1                         IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha, sha1
Aug  4 15:12:46.837936:   HMAC_SHA2_256                     IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha2, sha256, sha2_256
Aug  4 15:12:46.837950:   HMAC_SHA2_384                     IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha384, sha2_384
Aug  4 15:12:46.837955:   HMAC_SHA2_512                     IKEv1: IKE         IKEv2: IKE         FIPS NSS          sha512, sha2_512
Aug  4 15:12:46.837959:   AES_XCBC                          IKEv1:             IKEv2: IKE              native(XCBC) aes128_xcbc
Aug  4 15:12:46.837962: Integrity algorithms:
Aug  4 15:12:46.837966:   HMAC_MD5_96                       IKEv1: IKE ESP AH  IKEv2: IKE ESP AH       native(HMAC) md5, hmac_md5
Aug  4 15:12:46.837984:   HMAC_SHA1_96                      IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS          sha, sha1, sha1_96, hmac_sha1
Aug  4 15:12:46.837995:   HMAC_SHA2_512_256                 IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS          sha512, sha2_512, sha2_512_256, hmac_sha2_512
Aug  4 15:12:46.837999:   HMAC_SHA2_384_192                 IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS          sha384, sha2_384, sha2_384_192, hmac_sha2_384
Aug  4 15:12:46.838005:   HMAC_SHA2_256_128                 IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS          sha2, sha256, sha2_256, sha2_256_128, hmac_sha2_256
Aug  4 15:12:46.838008:   HMAC_SHA2_256_TRUNCBUG            IKEv1:     ESP AH  IKEv2:         AH                   
Aug  4 15:12:46.838014:   AES_XCBC_96                       IKEv1:     ESP AH  IKEv2: IKE ESP AH       native(XCBC) aes_xcbc, aes128_xcbc, aes128_xcbc_96
Aug  4 15:12:46.838018:   AES_CMAC_96                       IKEv1:     ESP AH  IKEv2:     ESP AH  FIPS              aes_cmac
Aug  4 15:12:46.838023:   NONE                              IKEv1:     ESP     IKEv2: IKE ESP     FIPS              null
Aug  4 15:12:46.838026: DH algorithms:
Aug  4 15:12:46.838031:   NONE                              IKEv1:             IKEv2: IKE ESP AH  FIPS NSS(MODP)    null, dh0
Aug  4 15:12:46.838035:   MODP1536                          IKEv1: IKE ESP AH  IKEv2: IKE ESP AH       NSS(MODP)    dh5
Aug  4 15:12:46.838039:   MODP2048                          IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS(MODP)    dh14
Aug  4 15:12:46.838044:   MODP3072                          IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS(MODP)    dh15
Aug  4 15:12:46.838048:   MODP4096                          IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS(MODP)    dh16
Aug  4 15:12:46.838053:   MODP6144                          IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS(MODP)    dh17
Aug  4 15:12:46.838057:   MODP8192                          IKEv1: IKE ESP AH  IKEv2: IKE ESP AH  FIPS NSS(MODP)    dh18
Aug  4 15:12:46.838061:   DH19                              IKEv1: IKE         IKEv2: IKE ESP AH  FIPS NSS(ECP)     ecp_256, ecp256
Aug  4 15:12:46.838066:   DH20                              IKEv1: IKE         IKEv2: IKE ESP AH  FIPS NSS(ECP)     ecp_384, ecp384
Aug  4 15:12:46.838070:   DH21                              IKEv1: IKE         IKEv2: IKE ESP AH  FIPS NSS(ECP)     ecp_521, ecp521
Aug  4 15:12:46.838074:   DH31                              IKEv1: IKE         IKEv2: IKE ESP AH       NSS(ECP)     curve25519
Aug  4 15:12:46.838077: IPCOMP algorithms:
Aug  4 15:12:46.838081:   DEFLATE                           IKEv1:     ESP AH  IKEv2:     ESP AH  FIPS             
Aug  4 15:12:46.838085:   LZS                               IKEv1:             IKEv2:     ESP AH  FIPS             
Aug  4 15:12:46.838089:   LZJH                              IKEv1:             IKEv2:     ESP AH  FIPS             
Aug  4 15:12:46.838093: testing CAMELLIA_CBC:
Aug  4 15:12:46.838096:   Camellia: 16 bytes with 128-bit key
Aug  4 15:12:46.838162:   Camellia: 16 bytes with 128-bit key
Aug  4 15:12:46.838201:   Camellia: 16 bytes with 256-bit key
Aug  4 15:12:46.838243:   Camellia: 16 bytes with 256-bit key
Aug  4 15:12:46.838280: testing AES_GCM_16:
Aug  4 15:12:46.838284:   empty string
Aug  4 15:12:46.838319:   one block
Aug  4 15:12:46.838352:   two blocks
Aug  4 15:12:46.838385:   two blocks with associated data
Aug  4 15:12:46.838424: testing AES_CTR:
Aug  4 15:12:46.838428:   Encrypting 16 octets using AES-CTR with 128-bit key
Aug  4 15:12:46.838464:   Encrypting 32 octets using AES-CTR with 128-bit key
Aug  4 15:12:46.838502:   Encrypting 36 octets using AES-CTR with 128-bit key
Aug  4 15:12:46.838541:   Encrypting 16 octets using AES-CTR with 192-bit key
Aug  4 15:12:46.838576:   Encrypting 32 octets using AES-CTR with 192-bit key
Aug  4 15:12:46.838613:   Encrypting 36 octets using AES-CTR with 192-bit key
Aug  4 15:12:46.838651:   Encrypting 16 octets using AES-CTR with 256-bit key
Aug  4 15:12:46.838687:   Encrypting 32 octets using AES-CTR with 256-bit key
Aug  4 15:12:46.838724:   Encrypting 36 octets using AES-CTR with 256-bit key
Aug  4 15:12:46.838763: testing AES_CBC:
Aug  4 15:12:46.838766:   Encrypting 16 bytes (1 block) using AES-CBC with 128-bit key
Aug  4 15:12:46.838801:   Encrypting 32 bytes (2 blocks) using AES-CBC with 128-bit key
Aug  4 15:12:46.838841:   Encrypting 48 bytes (3 blocks) using AES-CBC with 128-bit key
Aug  4 15:12:46.838881:   Encrypting 64 bytes (4 blocks) using AES-CBC with 128-bit key
Aug  4 15:12:46.838928: testing AES_XCBC:
Aug  4 15:12:46.838932:   RFC 3566 Test Case 1: AES-XCBC-MAC-96 with 0-byte input
Aug  4 15:12:46.839126:   RFC 3566 Test Case 2: AES-XCBC-MAC-96 with 3-byte input
Aug  4 15:12:46.839291:   RFC 3566 Test Case 3: AES-XCBC-MAC-96 with 16-byte input
Aug  4 15:12:46.839444:   RFC 3566 Test Case 4: AES-XCBC-MAC-96 with 20-byte input
Aug  4 15:12:46.839600:   RFC 3566 Test Case 5: AES-XCBC-MAC-96 with 32-byte input
Aug  4 15:12:46.839756:   RFC 3566 Test Case 6: AES-XCBC-MAC-96 with 34-byte input
Aug  4 15:12:46.839937:   RFC 3566 Test Case 7: AES-XCBC-MAC-96 with 1000-byte input
Aug  4 15:12:46.840373:   RFC 4434 Test Case AES-XCBC-PRF-128 with 20-byte input (key length 16)
Aug  4 15:12:46.840529:   RFC 4434 Test Case AES-XCBC-PRF-128 with 20-byte input (key length 10)
Aug  4 15:12:46.840698:   RFC 4434 Test Case AES-XCBC-PRF-128 with 20-byte input (key length 18)
Aug  4 15:12:46.840990: testing HMAC_MD5:
Aug  4 15:12:46.840997:   RFC 2104: MD5_HMAC test 1
Aug  4 15:12:46.841200:   RFC 2104: MD5_HMAC test 2
Aug  4 15:12:46.841390:   RFC 2104: MD5_HMAC test 3
Aug  4 15:12:46.841582: testing HMAC_SHA1:
Aug  4 15:12:46.841585:   CAVP: IKEv2 key derivation with HMAC-SHA1
Aug  4 15:12:46.842055: 8 CPU cores online
Aug  4 15:12:46.842062: starting up 7 helper threads
Aug  4 15:12:46.842128: started thread for helper 0
Aug  4 15:12:46.842174: helper(1) seccomp security disabled for crypto helper 1
Aug  4 15:12:46.842188: started thread for helper 1
Aug  4 15:12:46.842219: helper(2) seccomp security disabled for crypto helper 2
Aug  4 15:12:46.842236: started thread for helper 2
Aug  4 15:12:46.842258: helper(3) seccomp security disabled for crypto helper 3
Aug  4 15:12:46.842269: started thread for helper 3
Aug  4 15:12:46.842296: helper(4) seccomp security disabled for crypto helper 4
Aug  4 15:12:46.842311: started thread for helper 4
Aug  4 15:12:46.842323: helper(5) seccomp security disabled for crypto helper 5
Aug  4 15:12:46.842346: started thread for helper 5
Aug  4 15:12:46.842369: helper(6) seccomp security disabled for crypto helper 6
Aug  4 15:12:46.842376: started thread for helper 6
Aug  4 15:12:46.842390: using Linux xfrm kernel support code on #1 SMP PREEMPT_DYNAMIC Thu Jul 20 09:11:28 EDT 2023
Aug  4 15:12:46.842393: helper(7) seccomp security disabled for crypto helper 7
Aug  4 15:12:46.842707: selinux support is NOT enabled.
Aug  4 15:12:46.842728: systemd watchdog not enabled - not sending watchdog keepalives
Aug  4 15:12:46.843813: seccomp security disabled
Aug  4 15:12:46.848083: listening for IKE messages
Aug  4 15:12:46.848252: Kernel supports NIC esp-hw-offload
Aug  4 15:12:46.848534: adding UDP interface ovn-k8s-mp0 10.129.0.2:500
Aug  4 15:12:46.848624: adding UDP interface ovn-k8s-mp0 10.129.0.2:4500
Aug  4 15:12:46.848654: adding UDP interface br-ex 169.254.169.2:500
Aug  4 15:12:46.848681: adding UDP interface br-ex 169.254.169.2:4500
Aug  4 15:12:46.848713: adding UDP interface br-ex 10.0.0.8:500
Aug  4 15:12:46.848740: adding UDP interface br-ex 10.0.0.8:4500
Aug  4 15:12:46.848767: adding UDP interface lo 127.0.0.1:500
Aug  4 15:12:46.848793: adding UDP interface lo 127.0.0.1:4500
Aug  4 15:12:46.848824: adding UDP interface lo [::1]:500
Aug  4 15:12:46.848853: adding UDP interface lo [::1]:4500
Aug  4 15:12:46.851160: loading secrets from "/etc/ipsec.secrets"
Aug  4 15:12:46.851214: no secrets filename matched "/etc/ipsec.d/*.secrets"
Aug  4 15:12:47.053369: loading secrets from "/etc/ipsec.secrets"

sh-4.4# tcpdump -i any esp
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes^C
0 packets capturedsh-5.1# ovn-nbctl --no-leader-only get nb_global . ipsec
false
 

Version-Release number of selected component (if applicable):

openshift/cluster-network-operator#1874 

How reproducible:

Always

Steps to Reproduce:

1.Install OVN cluster and enable IPSec in runtime
2.
3.

Actual results:

no esp packets seen across the nodes

Expected results:

esp traffic should be seen across the nodes

Additional info:

 

OC mirror is GA product as of Openshift 4.11 .

The goal of this feature is to solve any future customer request for new features or capabilities in OC mirror 

Overview

This epic is a simple tracker epic for the proposed work and analysis for 4.14 delivery

As a oc-mirror user, I would like mirrored operator catalogs to have valid caches that reflect the contents of the catalog (configs folder) based on the filtering done in the ImageSetConfig for that catalog

so that the catalog image starts efficiently in a cluster.

Tasks:

  • white-out /tmp on all manifests (per platform)
  • Recreate the cache under /tmp/cache using
    • extract the whole catalog
    • use the opm binary included in the extracted catalog to call (command line)
opm serve /configs –-cache-dir /tmp/cache –-cache-only 
  • Create a new layer from /configs and /tmp/cache
    • the /tmp is compatible with all platforms
  • oc-mirror should not change the CMD nor ENTRYPOINT of the image
  • Rebuild catalog image up to the index (manifest list)

Acceptance criteria:

  • Run the catalog container with command opm serve <configDir> --cache-dir=<cacheDir> --cache-only --cache-enforce-integrity to verify the integrity of the cache
  • 4.14 catalogs mirrored with oc-mirror v4.14 run correctly in a cluster
    • when mirrored with mirrorToMirror workflow
    • when mirrored with mirrorToMirror workflow with --include-oci-local-catalogs
    • when mirrored with mirrorToDisk + diskToMirror workflow
  • 4.14 catalogs mirrored with oc-mirror v4.14 use the pre-computed cache (not sure how to test this)
  • catalogs<= 4.13 mirrored with oc-mirror v4.14 run correctly in a cluster (this is not something we publish as supported)

Proposed title of this feature request

Achieve feature parity for recently introduced functionality for all modes of operation

Nature and description of the request

Currently there are gaps in functionality within oc mirror that we would like addressed.

1. Support oci: references within mirror.operators[].catalog in an ImageSetConfiguration when running in all modes of operation with the full functionality provided by oc mirror.

Currently oci: references such as the following are allowed only in limited circumstances:

mirror:
   operators:
   - catalog: oci:///tmp/oci/ocp11840
   - catalog: icr.io/cpopen/ibm-operator-catalog
 

Currently supported scenarios

  • Mirror to Mirror

In this mode of operation the images are fetched from the oci: reference rather than being pulled from a source docker image repository. These catalogs are processed through similar (yet different) mechanisms compared to docker image references. The end result in this scenario is that the catalog is potentially modified and images (i.e. catalog, bundle, related images, etc.) are pushed to their final docker image repository. This provides the full capabilities offered by oc mirror (e.g. catalog "filtering", image pruning, metadata manipulation to keep track of what has been mirrored, etc.)

Desired scenarios
In the following scenarios we would like oci: references to be processed in a similar way to how docker references are handled (as close as possible anyway given the different APIs involved). Ultimately we want oci: catalog references to provide the full set of functionality currently available for catalogs provided as a docker image reference. In other words we want full feature parity (e.g. catalog "filtering", image pruning, metadata manipulation to keep track of what has been mirrored, etc.)

  • Mirror to Disk

In this mode of operation the images are fetched from the oci: reference rather than being pulled from a docker image repository. These catalogs are processed through similar yet different mechanisms compared to docker image references. The end result of this scenario is that all mappings and catalogs are packaged into tar archives (i.e. the "imageset").

  • Disk to Mirror

In this mode of operation the tar archives (i.e. the "imageset") are processed via the "publish mechanism" which means unpacking the tar archives, processing the metadata, pruning images, rebuilding catalogs, and pushing images to their destination. In theory if the mirror-to-disk scenario is handled properly, then this mode should "just work".

Below the line was the original RFE for requesting the OCI feature and is only provided for reference.

 

Description of problem:

Customer was able to limit the nested repository path with "oc adm catalog mirror" by using the argument "--max-components" but there is no alternate solution along with "oc-mirror" binary while we are suggesting to use "oc-mirror" binary for mirroring.for example:
Mirroring will work if we mirror like below
oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy
Mirroring will fail with 401 unauthorized if we add one more nested path like below
oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy/zzz

Version-Release number of selected component (if applicable):

 

How reproducible:

We can reproduce the issue by using a repository which is not supported deep nested paths

Steps to Reproduce:

1. Create a imageset to mirror any operator

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: ./oc-mirror-metadata
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
    packages:
    - name: local-storage-operator
      channels:
      - name: stable

2. Do the mirroring to a registry where its not supported deep nested repository path, Here its gitlab and its doesnt not support netsting beyond 3 levels deep.

oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy/zzz

this mirroring will fail with 401 unauthorized error
 
3. if  try to mirror the same imageset by removing one path it will work without any issues, like below

oc mirror --config=./imageset-config.yaml docker://registry.gitlab.com/xxx/yyy 

Actual results:

 

Expected results:

Need a alternative option of "--max-components" to limit the nested path in "oc-mirror"

Additional info:

 

This is a followup to https://issues.redhat.com/browse/OPNET-13. In that epic we implemented limited support for dual stack on VSphere, but due to limitations in upstream Kubernetes we were not able to support all of the use cases we do on baremetal. This epic is to track our work up and downstream to finish the dual stack implementation.

This is a followup to https://issues.redhat.com/browse/OPNET-13. In that epic we implemented limited support for dual stack on VSphere, but due to limitations in upstream Kubernetes we were not able to support all of the use cases we do on baremetal. This epic is to track our work up and downstream to finish the dual stack implementation.

Goal:
As a cluster administrator, I want OpenShift to include a recent HAProxy version, so that I have the latest available performance and security fixes.  

 Description:
We should strive to follow upstream HAProxy releases by bumping the HAProxy version that we ship in OpenShift with every 4.y release, so that OpenShift benefits from upstream performance and security fixes, and so that we avoid large version-number jumps when an urgent fix necessitates bumping to the latest HAProxy release.  This bump should happen as early as possible in the OpenShift release cycle, so as to maximize soak time.   

For OpenShift 4.13, this means bumping to 2.6.  

As a cluster administrator, 

I want OpenShift to include a recent HAProxy version, 

so that I have the latest available performance and security fixes.  

 

We should strive to follow upstream HAProxy releases by bumping the HAProxy version that we ship in OpenShift with every 4.y release, so that OpenShift benefits from upstream performance and security fixes, and so that we avoid large version-number jumps when an urgent fix necessitates bumping to the latest HAProxy release.  This bump should happen as early as possible in the OpenShift release cycle, so as to maximize soak time.   

For OpenShift 4.14, this means bumping to 2.6.  

Bump the HAProxy version in dist-git so that OCP 4.13 ships HAProxy 2.6.13, with this patch added on top: https://git.haproxy.org/?p=haproxy-2.6.git;a=commit;h=2b0aafdc92f691bc4b987300c9001a7cc3fb8d08. The patch fixes the segfault that was being tracked as OCPBUGS-13232.

This patch is in HAProxy 2.6.14, so we can stop carrying the patch once we bump to HAProxy 2.6.14 or newer in a subsequent OCP release.

Feature Overview (aka. Goal Summary)  

Tang-enforced, network-bound disk encryption has been available in OpenShift for some time, but all intended Tang-endpoints contributing unique key material to the process must be reachable during RHEL CoreOS provisioning in order to complete deployment.

If a user wants to require 3 of 6 tang servers be reachable than all 6 must be reachable during the provisioning process. This might not be possible due to maintenance, outage, or simply network policy during deployment. 

Enabling offline provisioning for first boot will help all of these scenarios.

 

Goals (aka. expected user outcomes)

The user can now provision a cluster with some or none of the Tang servers being reachable on first boot. Second boot, of course, will be subject to the Tang requirements being configured.

Done when:

  • Ignition spec default has been updated to 3.4
  • reconcile field (dependent on ignition 3.4)
  • consider Tang rotation? (write another epic)

This requires messy/complex work of grepping through for prior references to ignition and updating golang types that reference other versions.

Assumption that existing tests are sufficient to catch discrepancies. 

Goal

  • Allow users to set different Root volume types for each Control plane machine as a day-2 operation through CPMS
  • Allow users to set different Root volume types for each Control plane machine as install-time configuration through install-config

Why is this important?

  • In some OpenStack clouds, volume types are used to target separate OpenStack failure domains. With this feature, users can spread each Control plane root volume on separate OpenStack failure domains using the ControlPlaneMachineSet

Acceptance Criteria

  • Once the CPMS is updated with different root volume types in the Failure domains, CCPMSO spins new master machines with their root volumes spread.

Dependencies (internal and external)

  1. OpenShift-on-OpenStack integration with CPMS (OSASINFRA-3100)

Previous Work (Optional):

  1. 4.13 FailureDomains tech preview (OSASINFRA-2998)

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

FID: https://docs.google.com/document/d/1OEB7Vml1-TmpWZWbvHhf3lnrtEU5JZt2Sptcnu3Kv2I/edit#heading=h.fu58ua5viwam

  • add the JSON array controlPlane.platform.openstack.rootVolume.types (notice the "s") in install-config (this is an API addition)
  • add validation to prevent both rootVolume.type and rootVolume.types to be set
  • add validation to ensure that if a variable field (compute availability zones, storage availability zones, root volume types) have more than one value, they have equal length
  • change Machine generation to vary rootVolume.volumeType according to the machine-pool rootVolume.types
  • instrument the Terraform code to apply variable volume types

Goal

Allow to point to an existing OVA image stored in vSphere from the OpenShift installer, replacing the current method that uploads the OVA template every time an OpenShift cluster is installed.

Why is this important?

This is an improvement that makes the installation more efficient by not having to upload an OVA from where openshift-install is running every time a cluster is installed, saving time and bandwidth use. For example if an administrating is installing from a VPN then the OVA is upload through it to the target cluster every time an OpenShift cluster is installed. This makes the administration process more efficient by having a OVA centralised ready to use to install new clusters without uploading it from where the installer is run.

Epic Goal

  • To allow the use of a pre-existing RHCOS virtual machine or template via the IPI installer.

Why is this important?

  • It is a very common workflow in vSphere to upload a OVA. In the disconnected scenario the requirement of using a local web server, copying an ova to that webserver and then running the installer is a poor experience.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

Rebase openshift-etcd to latest upstream stable version 3.5.9

Goals (aka. expected user outcomes)

OpenShift openshift-etcd should benefit from the latest enhancements on version 3.5.9

 

https://github.com/etcd-io/etcd/issues/13538

We're currently on etcd 3.5.6, since then there has been at least another newer release.  This epic description is to track changes that we need to pay attention to:

 

Golang 1.17 update

In 3.5.7 etcd was moved to 1.17 to address some vulnerabilities:

https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.5.md#go

We need to update our definitions in the release repo to match this and test what impact it has.

EDIT: now moving onto 1.19 directly: https://github.com/etcd-io/etcd/pull/15337

 

WAL fix carry

3.5.6 had a nasty WAL bug that was hit by some customers, fixed with https://github.com/etcd-io/etcd/pull/15069

Due to the Golang upgrade we carried that patch through OCPBUGS-5458

When we upgrade we need to ensure the commits are properly handled and ordered with this carry.

 

IPv6 Formatting

There were some comparison issues with same IPv6 addresses having different formats. This was fixed in https://github.com/etcd-io/etcd/pull/15187 and we need to test what impact this has on our ipv6 based SKUs.

 

serializable memberlist 

This is a carry we have for some time: https://github.com/openshift/etcd/commit/26d7d842f6fb968e55fa5dbbd21bd6e4ea4ace50

This is now officially fixed (slightly different) with the options pattern in: https://github.com/etcd-io/etcd/pull/15261 

We need to drop the carry patch and take the upstream version when rebasing.

 

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Feature Goal

  • Enable platform=external to support onboarding new partners, e.g. Oracle Cloud Infrastructure and VCSP partners.
  • Create a new platform type, working name "External", that will signify when a cluster is deployed on a partner infrastructure where core cluster components have been replaced by the partner. “External” is different from our current platform types in that it will signal that the infrastructure is specifically not “None” or any of the known providers (eg AWS, GCP, etc). This will allow infrastructure partners to clearly designate when their OpenShift deployments contain components that replace the core Red Hat components.

This work will require updates to the core OpenShift API repository to add the new platform type, and then a distribution of this change to all components that use the platform type information. For components that partners might replace, per-component action will need to be taken, with the project team's guidance, to ensure that the component properly handles the "External" platform. These changes will look slightly different for each component.

To integrate these changes more easily into OpenShift, it is possible to take a multi-phase approach which could be spread over a release boundary (eg phase 1 is done in 4.X, phase 2 is done in 4.X+1).

OCPBU-5: Phase 1

  • Write platform “External” enhancement.
  • Evaluate changes to cluster capability annotations to ensure coverage for all replaceable components.
  • Meet with component teams to plan specific changes that will allow for supplement or replacement under platform "External".
  • Start implementing changes towards Phase 2.

OCPBU-510: Phase 2

  • Update OpenShift API with new platform and ensure all components have updated dependencies.
  • Update capabilities API to include coverage for all replaceable components.
  • Ensure all Red Hat operators tolerate the "External" platform and treat it the same as "None" platform.

OCPBU-329: Phase.Next

  • TBD

Why is this important?

  • As partners begin to supplement OpenShift's core functionality with their own platform specific components, having a way to recognize clusters that are in this state helps Red Hat created components to know when they should expect their functionality to be replaced or supplemented. Adding a new platform type is a significant data point that will allow Red Hat components to understand the cluster configuration and make any specific adjustments to their operation while a partner's component may be performing a similar duty.
  • The new platform type also helps with support to give a clear signal that a cluster has modifications to its core components that might require additional interaction with the partner instead of Red Hat. When combined with the cluster capabilities configuration, the platform "External" can be used to positively identify when a cluster is being supplemented by a partner, and which components are being supplemented or replaced.

Scenarios

  1. A partner wishes to replace the Machine controller with a custom version that they have written for their infrastructure. Setting the platform to "External" and advertising the Machine API capability gives a clear signal to the Red Hat created Machine API components that they should start the infrastructure generic controllers but not start a Machine controller.
  2. A partner wishes to add their own Cloud Controller Manager (CCM) written for their infrastructure. Setting the platform to "External" and advertising the CCM capability gives a clear to the Red Hat created CCM operator that the cluster should be configured for an external CCM that will be managed outside the operator. Although the Red Hat operator will not provide this functionality, it will configure the cluster to expect a CCM.

Acceptance Criteria

Phase 1

  • Partners can read "External" platform enhancement and plan for their platform integrations.
  • Teams can view jira cards for component changes and capability updates and plan their work as appropriate.

Phase 2

  • Components running in cluster can detect the “External” platform through the Infrastructure config API
  • Components running in cluster react to “External” platform as if it is “None” platform
  • Partners can disable any of the platform specific components through the capabilities API

Phase 3

  • Components running in cluster react to the “External” platform based on their function.
    • for example, the Machine API Operator needs to run a set of controllers that are platform agnostic when running in platform “External” mode.
    • the specific component reactions are difficult to predict currently, this criteria could change based on the output of phase 1.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. Identifying OpenShift Components for Install Flexibility

Open questions::

  1. Phase 1 requires talking with several component teams, the specific action that will be needed will depend on the needs of the specific component. At the least the components need to treat platform "External" as "None", but there could be more changes depending on the component (eg Machine API Operator running non-platform specific controllers).

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • Empower External platform type user to specify when they will run their own CCM

Why is this important?

  • For partners wishing to use components that require zonal awareness provided by the infrastructure (for example CSI drivers), they will need to exercise their own cloud controller managers. This epic is about adding the proper configuration to OpenShift to allow users of External platform types to run their own CCMs.

Scenarios

  1. As a Red Hat partner, I would like to deploy OpenShift with my own CSI driver. To do this I need my CCM deployed as well. Having a way to instruct OpenShift to expect an external CCM deployment would allow me to do this.

Acceptance Criteria

  • CI - A new periodic test based on the External platform test would be ideal
  • Release Technical Enablement - Provide necessary release enablement details and documents.
    • Update docs.ci.openshift.org with CCM docs

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://github.com/openshift/enhancements/blob/master/enhancements/cloud-integration/infrastructure-external-platform-type.md#api-extensions
  2. https://github.com/openshift/api/pull/1409

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As a user I want to use the openshift installer to create clusters of platform type External so that I can use openshift more effectively on a partner provider platform.

Background

To fully support the External platform type for partners and users, it will be useful to be able to have the installer understand when it sees the external platform type in the install-config.yaml, and then to properly populate the resulting infrastructure config object with the external platform type and platform name.

As defined in https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go#L241 , the external platform type allows the user to specify a name for the platform. This card is about updating the installer so that a user can provide both the external type and a platform name that will be expressed in the infrastructure manifest.

Aside from this information, the installer should continue with a normal platform "None" installation.

Steps

  • update installer to allow platform "External" specified in the install-config.yaml
  • update installer to allow platform name to specified as part of the External platform configuration

Stakeholders

  • openshift cloud infra team
  • openshift installer team
  • openshift assisted installer team

Definition of Done

  • user can specify external platform in the install-config.yaml and have a cluster with External platform type and a name for the platform.
  • cluster installs as expected for platform external (similar to none)
  • Docs
  • Testing
  • this feature should allow us to update our external platform tests to make the installation easier, tests should be updated to include this methodology

User Story

As a Red Hat Partner installing OpenShift using the External platform type, I would like to install my own Cloud Controller Manager(CCM). Having a field in the Infrastructure configuration object to signal that I will install my own CCM and that Kubernetes should be configured to expect an external CCM will allow me to run my own CCM on new OpenShift deployments.

Background

This work has been defined in the External platform enhancement , and had previously been part of openshift/api . The CCM API pieces were removed for the 4.13 release of OpenShift to ensure that we did not ship unused portions of the API.

In addition to the API changes, library-go will need to have an update to the  IsCloudProviderExternal function to detect the if the External platform is selected and if the CCM should be enabled for external mode.

We will also need to check the ObserveCloudVolumePlugin function to ensure that it is not affected by the external changes and that it continues to use the external volume plugin.

After updating openshift/library-go, it will need to be re-vendored into the MCO  , KCMO , and CCCMO  (although this is not as critical as the other 2).

Steps

  • update openshift/api with new CCM fields (re-revert #1409)
  • revendor api to library-go
  • update IsCloudProviderExternal in library-go to observe the new API fields
  • investigate ObserveCloudVolumePlugin to see if it requires changes
  • revendor library-go to MCO, KCMO, and CCCMO
  • update enhancement doc to reflect state

Stakeholders

  • openshift eng
  • oracle cloud install effort

Definition of Done

  • openshift can be installed with External platform type with kubelet, and related components, using the external cloud provider flags.
  • Docs
  • this will need to be documented in the API and as part of OCPCLOUD-1581
  • Testing
  • this will need validation through unit test, integration testing may be difficult as we will need a new e2e built off the external platform with a ccm

Feature Overview

In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.

The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.

 

MVP: bring the off-cluster build environment on-cluster

    • Repo control
      • rpm-ostree needs repo management commands
    • Entitlement management

In the context of the Machine Config Operator (MCO) in Red Hat OpenShift, on-cluster builds refer to the process of building an OS image directly on the OpenShift cluster, rather than building them outside the cluster (such as on a local machine or continuous integration (CI) pipeline) and then making a configuration change so that the cluster uses them. By doing this, we enable cluster administrators to have more control over the contents and configuration of their clusters’ OS image through a familiar interface (MachineConfigs and in the future, Dockerfiles).

At the layering sync meeting on Thursday, August 10th, it was decided that for this to be considered ready for Dev / Tech Preview, cluster admins need a way to inject custom Dockerfiles into their on-cluster builds.

 

(Commentary: It was also decided 4 months ago that this was not an MVP requirement in https://docs.google.com/document/d/1QSsq0mCgOSUoKZ2TpCWjzrQpKfMUL9thUFBMaPxYSLY/edit#heading=h.jqagm7kwv0lg. And quite frankly, this requirement should have been known at that point in time as opposed to the week before tech preview.)

To speed development for on-cluster builds and avoid a lot of complex code paths, the decision was made to put all functionality related to building OS images and managing internal registries into a separate binary within the MCO.

Eventually, this binary will be responsible for running the productionized BuildController and know how to respond to Machine OS Builder API objects. However, until the productionized BuildController and opt-in portions are ready, the first pass of this binary will be much simpler: For now, it can connect to the API server and print a "Hello World".

 

Done When:

  • We have a new binary under cmd/machine-os-builder. This binary will be built alongside the current MCO components and will be baked into the MCO image.
  • The Dockerfile, Makefile, and build scripts will need some modification so that they how to build cmd/machine-os-builder.
  • A Deployment manifest is created under manifests/ which is set up to start up a single instance of the new binary though we don't want it to start up by default right now since it won't do anything useful.

This is the "consumption" side of the security – rpm-ostree needs to be able to retrieve images from the internal registry seamlessly.

This will involve setting up (or using some existing) pull secrets, and then getting them to the proper location on disk so that rpm-ostree can use them to pull images.

The second phase of the layering effort involved creating a BuildController, whose job is to start and manage builds of OS images. While it should be able to perform those functions on its own, getting the built OS image onto each of the cluster nodes involves modifying other parts of the MCO to be layering-aware. To that end, there are three pieces involve, some of which will require modification:

Render Controller

Right now, the render controller listens for incoming MachineConfig changes. It generates the rendered config which is comprised of all of the MachineConfigs for a given MachineConfigPool. Once rendered, the Render Controller updates the MachineConfigPool to point to the new config. This portion of the MCO will not likely need any modification that I'm aware of at the moment.

Node Controller

The Node Controller listens for MachineConfigPool config changes. Whenever it identifies that a change has occurred, it applies the machineconfiguration.openshift.io/desiredConfig annotation to all the nodes in the targeted MachineConfigPool which causes the Machine Config Daemon (MCD) to apply the new configs. With this new layering mechanism, we'll need to add the additional annotation of machineconfiguration.openshift.io/desiredOSimage which will contain the fully-qualified pullspec for the new OS image (referenced by the image SHA256 sum). To be clear, we will not be replacing the desiredConfig annotation with the desiredOSimage annotation; both will still be used. This will allow Config Drift Monitor to continue to function the way it does with no modification required.

Machine Config Daemon

Right now, the MCD listens to Node objects for changes to the machineconfiguration.openshift.io/desiredConfig annotation. With the new desiredOSimage annotation being present, the MCD will need to skip the parts of the update loop which write files and systemd units to disk. Instead, it will skip directly to the rpm-ostree application phase (after making sure the correct pull secrets are in place, etc.).

 

Done When:

  • The above modifications are made.
  • Each modification has been done with appropriate unit tests where feasible.

The first phase of the layering effort involved creating a BuildController, whose job is to start and manage builds using the OpenShift Build API. We can use the work done to create the BuildController as the basis for our MVP. However, what we need from BuildController right now is less than BuildController currently provides. With that in mind, we need to remove certain parts of BuildController to create a more streamlined and simpler implementation ideal for an MVP.

 

Done when a version of BuildController is landed which does the following things:

  • Listens for all MachineConfigPool events. If a MachineConfigPool with a specific label or annotation (e.g., machineconfiguration.openshift.io/layering-enabled), the BuildController should retrieve the latest rendered MachineConfig associated with the MachineConfigPool, generate a series of inputs to a builder backend (for now, the OpenShift Build API can be the first backend), then update the MachineConfigPool with the outcome of that action. In the case of a successful build, the MachineConfigPool should be updated with the image pullspec for the newly-built image. For now, this can come in the form of an annotation or a label (e.g., machineconfiguration.openshift.io/desired-os-image = "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/coreos@sha256:abcdef1234567890...). But eventually, it should be a Status field on the MachineConfigPool object.
  • Reads from a ConfigMap which contains the following items (let's call it machine-os-builder-config for now):
    • Name of the base OS image pull secret.
    • Name of the final OS image push secret.
    • Target container registry and org / repo information for where to push the final OS image (e.g., image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/coreos).
  • All functionality around managing ImageStreams and OpenShift Builds is removed or decoupled. In the case of the OpenShift Build functionality, it will be decoupled instead of completely removed. Additionally, it should not use BuildConfigs. It should instead create and manage image Build objects directly.
  • Use contexts for handling shutdowns and timeouts.
  • Unit tests are written for the major BuildController functionalities using either FakeClient or EnvTest.
  • The modified BuildController and its tests are merged into the master branch of the MCO. Note: This does not mean that it will be immediately active in the MCO's execution path. However, tests will be executed in CI.

Feature Overview

Goals

  • Support OpenShift to be deployed from day-0 on AWS Local Zones
  • Support an existing OpenShift cluster to deploy compute Nodes on AWS Local Zones (day-2)

AWS Local Zones support - feature delivered in phases:

  • Phase 0 (OCPPLAN-9630): Document how to create compute nodes on AWS Local Zones in day-0 (SPLAT-635)
  • Phase 1 ( OCPBU-2): Create edge compute pool to generate MachineSets for node with NoSchedule taints when installing a cluster in existing VPC with AWS Local Zone subnets (SPLAT-636)
  • Phase 2 (OCPBU-351): Installer automates network resources creation on Local Zone based on the edge compute pool (SPLAT-657)

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

Epic Goal

Fully automated installation creating subnets in AWS Local Zones when the zone names are added to the edge compute pool on install-config.yaml.

  • The installer should create the subnets on the Local Zones according to the configuration of the "edge" compute pool, provided on install-config.yaml 

Why is this important?

  • Users can extend the presence of worker nodes closer to the metropolitan regions, where the users or on-premises workloads are running, decreasing the time to deliver their workloads to their clients.

Scenarios

  • As a cluster admin, I would like to install OpenShift clusters, extending the compute nodes to the Local Zones in my day-zero operations without needing to set up the network and compute dependencies, so I can speed up the edge adoption in my organization using OCP.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated.
  • CI - custom jobs should be added to test Local Zone provisioning
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • The PR on the installer repo should be merged after being approved by the Installer team, QE, and docs
  • The product documentation has been created

Dependencies (internal and external)

  1. SPLAT-636 : install a cluster in existing VPC extending workers to Local Zones
  2. OCPBUGSM-46513 : Bug - Ingress Controller should not add Local Zones subnets to network routers/LBs (Classic/NLB)

Previous Work (Optional):

  1. Enhancement 1232
  2. SPLAT-636 : AWS Local Zones - Phase 1 IPI edge pool - Installer support to automatically create the MachineSets when installing in existing VPC

Open questions:

Done Checklist

Feature Overview

  • As a Cluster Administrator, I want to opt-out of certain operators at deployment time using any of the supported installation methods (UPI, IPI, Assisted Installer, Agent-based Installer) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
  • As a Cluster Administrator, I want to opt-in to previously-disabled operators (at deployment time) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
  • As a ROSA service administrator, I want to exclude/disable Cluster Monitoring when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — since I get cluster metrics from the control plane.  This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.
  • As a ROSA service administrator, I want to exclude/disable Ingress Operator when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — as I want to use my preferred load balancer (i.e. AWS load balancer).  This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.

Goals

  • Make it possible for customers and Red Hat teams producing OCP distributions/topologies/experiences to enable/disable some CVO components while still keeping their cluster supported.

Scenarios

  1. This feature must consider the different deployment footprints including self-managed and managed OpenShift, connected vs. disconnected (restricted and air-gapped), supported topologies (standard HA, compact cluster, SNO), etc.
  2. Enabled/disabled configuration must persist throughout cluster lifecycle including upgrades.
  3. If there's any risk/impact of data loss or service unavailability (for Day 2 operations), the System must provide guidance on what the risks are and let user decide if risk worth undertaking.

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This part of the overall multiple release Composable OpenShift (OCPPLAN-9638 effort), which is being delivered in multiple phases:

Phase 1 (OpenShift 4.11): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators

  • CORS-1873 Installer to allow users to select OpenShift components to be included/excluded
  • OTA-555 Provide a way with CVO to allow disabling and enabling of operators
  • OLM-2415 Make the marketplace operator optional
  • SO-11 Make samples operator optional
  • METAL-162 Make cluster baremetal operator optional
  • OCPPLAN-8286 CI Job for disabled optional capabilities

Phase 2 (OpenShift 4.12): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators

Phase 3 (OpenShift 4.13): OCPBU-117

  • OTA-554 Make oc aware of cluster capabilities
  • PSAP-741 Make Node Tuning Operator (including PAO controllers) optional

Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)

  • CCO-186 ccoctl support for credentialing optional capabilities
  • MCO-499 MCD should manage certificates via a separate, non-MC path (formerly IR-230 Make node-ca managed by CVO)
  • CNF-5642 Make cluster autoscaler optional
  • CNF-5643 - Make machine-api operator optional
  • WRKLDS-695 - Make DeploymentConfig API + controller optional
  • CNV-16274 OpenShift Virtualization on the Red Hat Application Cloud (not applicable)
  • CNF-9115 - Leverage Composable OpenShift feature to make control-plane-machine-set optional

Phase 5 (OpenShift 4.15): OCPSTRAT-421 (formerly) OCPBU-519

  • OCPBU-352 Make Ingress Operator optional
  • BUILD-565 - Make Build v1 API + controller optional
  • OBSDA-242 Make Cluster Monitoring Operator optional
  • OCPVE-630 (formerly CNF-5647) Leverage Composable OpenShift feature to make image-registry optional (replaces IR-351 - Make Image Registry Operator optional)
  • CNF-9114 - Leverage Composable OpenShift feature to make olm optional
  • CNF-9118 - Leverage Composable OpenShift feature to make cloud-credential  optional
  • CNF-9119 - Leverage Composable OpenShift feature to make cloud-controller-manager optional

Phase 6 (OpenShift 4.16): OCPSTRAT-731

  • TBD

References

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

 

 

Per https://github.com/openshift/enhancements/pull/922 we need `oc adm release new` to parse the resource manifests for `capability` annotations and generate a yaml file that lists the valid capability names, to embed in the release image.

This file can be used by the installer to error or warn when the install config lists capabilities for enable/disable that are not valid capability names.

 

Note: Moved the couple of cards from OTA-554 to this epic as these cards are relatively less priority for 4.13 release and we could not mark these done.

While working on OTA-559, my oc#1237 broke JSON output, and needed a follow-up fix. To avoid destabilizing folks who consume the dev-tip oc, we should grow CI presubmits to exercise critical oc adm release ... pathways, to avoid that kind of accidental breakage.

oc adm release extract --included ... or some such, that only works when no release pullspec is given, where oc connects to the cluster to ask after the current release image (as it does today when you leave off a pullspec) but also collects FeatureGates and cluster profile and all that sort of stuff so it can write only the manifests it expects the CVO to be attempting to reconcile.

This would be narrowly useful for ccoctl (see CCO-178 and CCO-186), because with this extract option, ccoctl wouldn't need to try to reproduce "which of these CredentialsRequests manifests does the cluster actually want filled?" locally.

It also seems like it would be useful for anyone trying to get a better feel for what the CVO is up to in their cluster, for the same reason that it reduces distracting manifests that don't apply.

The downside is that if we screw up the inclusion logic, we could have oc diverging from the CVO, and end up increasing confusion instead of decreasing confusion. If we move the inclusion logic to library-go, that reduces the risk a bit, but there's always the possibility that users are using an oc that is older or newer than the cluster's CVO. Some way to have oc warn when the option is used but the version differs from the current CVO version would be useful, but possibly complicated to implement, unless we take shortcuts like assuming that the currently running CVO has a version matched to the ClusterVersion's status.desired target.

Definition of done (more details in the OTA-692 spike comment):

  • Add a new --included flag to $ oc adm release extract --to <dir path> <pull-spec or version-number>. The --included flag filters extracted manifests to those that are expected to be included with the cluster. 
    • Move overrides handling here and here into library-go.

 

 here is a sketch of code which W. Trevor King suggested

Epic Goal

  • Add an optional capability that allows disabling the image registry operator entirely

Why is this important?

It is already possibly to run a cluster with no instantiated image registry, but the image registry operator itself always runs.  This is an unnecessary use of resources for clusters that don't need/want a registry.  Making it possible to disable this will reduce the resource footprint as well as bug risks for clusters that don't need it, such as SNO and OKE.

 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated (we have an existing CI job that runs a cluster with all optional capabilities disabled.  Passing that job will require disabling certain image registry tests when the cap is disabled)
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1.  MCO-499 must be completed first because we still need the CA management logic running even if the image registry operator is not running.

Previous Work (Optional):

  1. The optional cap architecture and guidance for adding a new capability is described here: https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To enable the MCO to replace the node-ca, the registry operator needs to provide its own CAs in isolation.

Currently, the registry provides its own CAs via the "image-registry-certificates" configmap. This configmap is a merge of the service ca, storage ca, and additionalTrustedCA (from images.config.openshift.io/cluster).

Because the MCO already has access to additionalTrustedCA, the new secret does not need to contain it.

 

ACCEPTANCE CRITERIA

TBD

  1. Proposed title of this feature request:

Update ETCD datastore encryption to use AES-GCM instead of AES-CBC

2. What is the nature and description of the request?

The current ETCD datastore encryption solution uses the aes-cbc cipher. This cipher is now considered "weak" and is susceptible to padding oracle attack.  Upstream recommends using the AES-GCM cipher. AES-GCM will require automation to rotate secrets for every 200k writes.

The cipher used is hard coded. 

3. Why is this needed? (List the business requirements here).

Security conscious customers will not accept the presence and use of weak ciphers in an OpenShift cluster. Continuing to use the AES-CBC cipher will create friction in sales and, for existing customers, may result in OpenShift being blocked from being deployed in production. 

4. List any affected packages or components.

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

The Kube APIserver is used to set the encryption of data stored in etcd. See https://docs.openshift.com/container-platform/4.11/security/encrypting-etcd.html

 

Today with OpenShift 4.11 or earlier, only aescbc is allowed as the encryption field type. 

 

RFE-3095 is asking that aesgcm (which is an updated and more recent type) be supported. Furthermore RFE-3338 is asking for more customizability which brings us to how we have implemented cipher customzation with tlsSecurityProfile. See https://docs.openshift.com/container-platform/4.11/security/tls-security-profiles.html

 

 
Why is this important? (mandatory)

AES-CBC is considered as a weak cipher

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

AES-GCM encryption was enabled in cluster-openshift-apiserver-operator and cluster-openshift-autenthication-operator, but not in the cluster-kube-apiserver-operator. When trying to enable aesgcm encryption in the apiserver config, the kas-operator will produce an error saying that the aesgcm provider is not supported.

Feature Overview

Support Platform external to allow installing with agent on OCI, with focus on https://www.oracle.com/cloud/cloud-at-customer/dedicated-region/faq/ for disconnected, on-prem.

Related / parent feature

OCPSTRAT-510 OpenShift on Oracle Cloud Infrastructure (OCI) with VMs

Feature Overview

Support Platform external to allow installing with agent on OCI, with focus on https://www.oracle.com/cloud/cloud-at-customer/dedicated-region/faq/ for disconnected, on-prem

User Story:

As a user, I want to be able to:

  • generate the minimal ISO in the installer when the platform type is set to external/oci

so that I can achieve

  • successful cluster installation
  • any custom agent features such as network tui should be available when booting from minimal ISO

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a user of the agent-based installer, I want to be able to:

  • create agent ISO as well as PXE assets by providing the install-config.yaml

so that I can achieve

  • create a cluster for external cloud provider platform type (OCI)

Acceptance Criteria:

Description of criteria:

  • install-config.yaml accepts the new platform type "external"
  • validate install-config so that platformName can only be set to `oci` when platform is external 
  • agent-based installer validates the supported platforms
  • agent ISO and PXE assets should be created successfully
  • necessary unit tests and integration tests are added

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a user of the agent-based installer, I want to be able to:

  • validate the external platform type in the agent cluster install by providing the external platform type in the install-config.yaml

so that I can achieve

  • create agent artifacts ( ISO, PXE files)

Acceptance Criteria:

Description of criteria:

  • install-config.yaml accepts the new platform type "external"
  • agent-based installer validates the supported platforms
  • agent ISO and PXE assets should be created successfully
  • Required k8s API support is added

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview (aka. Goal Summary)  

Support OpenShift installation in AWS Shared VPC [1] scenario where AWS infrastructure resources (at least the Private Hosted Zone) belong to an account separate from the cluster installation target account.

Goals (aka. expected user outcomes)

As a user I need to use a Shared VPC [1] when installing OpenShift on AWS into an existing VPC. Which will at least require the use of a preexisting Route53 hosted zone where I am not allowed the user "participant" of the shared VPC to automatically create Route53 private zones.

Requirements (aka. Acceptance Criteria):

The Installer is able to successfully deploy OpenShift on AWS with a Shared VPC [1], and the cluster is able to successfully pass osde2e testing. This will include at least the scenario when private hostedZone belongs to different account (Account A) than cluster resources (Account B)

[1] https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Enable/confirm installation in AWS shared VPC scenario where Private Hosted Zone belongs to an account separate from the cluster installation target account

Why is this important?

  • AWS best practices suggest this setup

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

I want

  • the installer to check for appropriate permissions based on whether the installation is using an existing hosted zone and whether that hosted zone is in another account

so that I can

  • be sure that my credentials have sufficient and minimal permissions before beginning install

Acceptance Criteria:

Description of criteria:

  • When specifying platform.aws.hostedZoneRole. Route53:CreateHostedZone and Route53:DeleteHostedZone are not required

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic —

Links:

Enhancement PR: https://github.com/openshift/enhancements/pull/1397 

API PR: https://github.com/openshift/api/pull/1460 

Ingress  Operator PR: https://github.com/openshift/cluster-ingress-operator/pull/928 

Background

Feature Goal: Support OpenShift installation in AWS Shared VPC scenario where AWS infrastructure resources (at least the Private Hosted Zone) belong to an account separate from the cluster installation target account.

The ingress operator is responsible for creating DNS records in AWS Route53 for cluster ingress. Prior to the implementation of this epic, the ingress operator doesn't have the capability to add DNS records into an existing Route 53 hosted zone in the shared VPC.

Epic Goal

  • Add support to the ingress operator for creating DNS records in preexisting Route53 private hosted zones for Shared VPC clusters

Non-Goals

  • Ingress operator support for day-2 operations (i.e. changes to the AWS IAM Role value after installation)  
  • E2E testing (will be handled by the Installer Team) 

Design

As described in the WIP PR https://github.com/openshift/cluster-ingress-operator/pull/928, the ingress operator will consume a new API field that contains the IAM Role ARN for configuring DNS records in the private hosted zone. If this field is present, then the ingress operator will use this account to create all private hosted zone records. The API fields will be described in the Enhancement PR.

The ingress operator code will accomplish this by defining a new provider implementation that wraps two other DNS providers, using one of them to publish records to the public zone and the other to publish records to the private zone.

External DNS Operator Impact

See NE-1299

AWS Load Balancer Operator (ALBO) Impact

See NE-1299

Why is this important?

  • Without this ingress operator support, OpenShift users are unable to create DNS records in a preexisting Route53 private hosted zone which means OpenShift users can't share the Route53 component with a Shared VPC
  • Shared VPCs are considers AWS best practice

Scenarios

  1. ...

Acceptance Criteria

  • Unit tests must be written and automatically run in CI (E2E tests will be handled by the Installer Team)
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Ingress Operator creates DNS Records in preexisting Route53 private hosted zones for shared VPC Clusters
  • Network Edge Team has reviewed all of the related enhancements and code changes for Route53 in Shared VPC Clusters

Dependencies (internal and external)

  1. Installer Team is adding the new API fields required for enabling sharing Route53 with in Shared VPCs in https://issues.redhat.com/browse/CORS-2613
  2. Testing this epic requires having access to two AWS account

Previous Work (Optional):

  1. Significant discussion was done in this thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1681997102492889?thread_ts=1681837202.378159&cid=C68TNFWA2
  1. Slack channel #tmp-xcmbu-114

Open questions:

  1.  

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

Feature Overview (aka. Goal Summary)  

During oc login with a token, pasting the token on command line with oc login --token command is insecure. The token is logged in bash history, and appears in a "ps" command when ran precisely at the time the oc login command runs. Moreover, the token gets logged and is searchable by any sysadmin.

Customers/Users would like either the "--web" command, or a command that prompt for a token. There should be no way to pass a secret on a command line with --token command. 

For environments where no web browser is available, a "--ask-token" option should be provided that prompts for a token instead of passing it on the command line.

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal*

During oc login with a token, pasting the token on command line with oc login --token command is insecure. The token is logged in bash history, and appears in a "ps" command when ran precisely at the time the oc login command runs. Moreover, the token gets logged and is searchable by any sysadmin.

Customers/Users would like either the "--web" command, or a command that prompt for a token. There should be no way to pass a secret on a command line with --token command. 

For environments where no web browser is available, a "--ask-token" option should be provided that prompts for a token instead of passing it on the command line.

 
Why is this important? (mandatory)

Pasting the token on command line with oc login --token command is insecure

 
Scenarios (mandatory) 

Customers/Users would like either the "--web" command. There should be no way to pass a secret on a command line with --token command. 

For environments where no web browser is available, a "--ask-token" option should be provided that prompts for a token instead of passing it on the command line.

 

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

 

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

In order for the OAuth2 Authorization Code Grant Flow to work in oc browser login, we need a new OAuthClient that can obtain tokens through [PKCE|https://datatracker.ietf.org/doc/html/rfc7636,] as the existing clients do not have this capability. The new client will be called openshift-cli-client and will have the loopback addresses as valid Redirect URIs.

In order for the OAuth2 Authorization Code Grant Flow to work in oc browser login, the OSIN server must ignore any port used in the Redirect URIs of the flow when the URIs are the loopback addresses. This has already been added to OSIN; we need to update the oauth-server to use the latest version of OSIN in order to make use of this capability.

In order to secure token usage during oc login, we need to add the capability to oc to login using the OAuth2 Authorization Code Grant Flow through a browser. This will be possible by providing a command line option to oc:

oc login --web

 

Review the OVN Interconnect proposal, figure out the work that needs to be done in ovn-kubernetes to be able to move to this new OVN architecture. 

Phase-2 of this project in continuation of what was delivered in the earlier release. 

Why is this important?

OVN IC will be the model used in Hypershift. 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

See https://docs.google.com/presentation/d/17wipFv5wNjn1KfFZBUaVHN3mAKVkMgGWgQYcvss2yQQ/edit#slide=id.g547716335e_0_220 

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

For interconnect upgrades - i.e when moving from OCP 4.13 to OCP 4.14 where IC is enabled, we do a 2 phase rollout of ovnkube-master and ovnkube-node pods in the openshift-ovn-kubernetes namespace. This is to ensure we have minimum disruption since major architectural components are being brought from control-plane down to the data-plane.

Since its a two phase roll out with each phase taking taking approximately 10mins, we effectively double the time it takes for OVNK component to upgrade thereby increasing the timeout thresholds on AWS.

See https://redhat-internal.slack.com/archives/C050MC61LVA/p1689768779938889 for some more details.

See sample runs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1679589472833900544

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1679589451010936832

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1678480739743567872

I have noticed this happening once on GCP:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1680563737225859072

This has not happened on Azure which has 95mins allowance. So this card tracks the work to increase the timers on AWS/GCP. This was brought up in the TRT team sync that happened yesterday (July 19th 2023) and Scott Dodson has agreed to approve this under the condition that we bring it down back to the current values in release 4.15.

SDN team is confident the time will drop back to normal for future upgrades going from 4.14 -> 4.15 and so on. This will be tracked via https://issues.redhat.com/browse/OTA-999 

In the non-IC world, we have centralised DB, running a trace is easy, in IC world, we'd need all the local DBs from each node to even run a pod2pod trace fully else we can only run half traces with one side DB.

Goal of this card:

  • Open a PR against `oc` repo to get all dbs (minimum requirement)

Users would desire to create EFA instance MachineSet in the same AWS placement group to get best network performance within that AWS placement group.

The Scope of this Epic is only to support placement groups. Customers will create them.
The customer ask is that placement groups don't need to be created by the OpenShift Container Platform
OpenShift Container Platform only needs to be able to consume them and assign machines out of a machineset to a specific Placement Group.

Users would desire to create EFA instance MachineSet in the same AWS placement group to get best network performance within that AWS placement group.

Note: This Epic was previously connected to https://issues.redhat.com/browse/OCPPLAN-8106 and has been updated to OCPBU-327.

Scope

The Scope of this Epic is only to support placement groups. Customers will create them.
The customer ask is that placement groups don't need to be created by the OpenShift Container Platform
OpenShift Container Platform only needs to be able to consume them and assign machines out of a machineset to a specific Placement Group.

Background

In CAPI, the AWS provider supports the user supplying the name of a pre-existing placement group. Which will then be used to create the instances.

https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/4273

We need to add the same field to our API and then pass the information through in the same way, to allow users to leverage placement groups.

Steps

  • Review the upstream code linked above
  • Backport the feature
  • Drop old code for placement group controller that is currently disabled

Stakeholders

  • Cluster Infra

Definition of Done

  • Users may provide a pre-existing placement group name and have their instances created within that placement group
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

This epic contains all the OLM related stories for OCP release-4.14

Epic Goal

  • Track all the stories under a single epic

Console operator should be building up a set of cluster nodes OS types, which he should supply to console, so it renders only operators that could be installed on the cluster.

This will be needed when we will support different OS types on the cluster.

We need to scan through the compute nodes and build a set of supported OS from those. Each node on the cluster has a label for its operating system: e.g. kubernetes.io/os=linux,

 

AC:

  1. Implement logic in the console repo
    1. Add additional flag
    2. populate the supported OS types into SERVER_FLAGS
    3. update the filtering logic in the operator hub

Console operator should be building up a set of cluster nodes OS types, which he should supply to console, so it renders only operators that could be installed on the cluster.

This will be needed when we will support different OS types on the cluster.

We need to scan through the compute nodes and build a set of supported OS from those. Each node on the cluster has a label for its operating system: e.g. kubernetes.io/os=linux,

 

AC:

  1. Implement logic in the console-operator that will scan though all the nodes and build a set of all the OS types that the cluster nodes run on and pass it to the console-config.yaml . This set of OS types will be then used by console frontend.
  2. Add unit and e2e test cases in the console-operator repository.

Goal: OperatorHub/OLM users get a more intuitive UX around discovering and selecting Operator versions to install.

Problem statement: Today it's not possible to install an older version of an Operator unless the user exactly nows the CSV semantic version. This is not exposed however through any API. `packageserver` as of today only shows the latest version per channel.

Why is this important: There are many reasons why a user would want to choose not to install the latest version - whether it's lack of testing or known problems. It should be easy for a user to discovers what versions of an Operator OLM has in its catalogs and update graphs and expose this information in a consumable way to the user.

Acceptance Criteria:

  • Users can choose from a list of "available versions" of an Operator based on the "selected channel" on the 'OperatorHub' page in the console.
  • Users can see/examine Operator metadata (e.g. descriptions, version, capability level, links, etc) per selected channel/version to confirm the exact version they are going to install on the OperatorHub page.
  • The selected channel/version info will be carried over from the 'OperatorHub' page to 'Install Operator' page in the console.
  • Note that "installing an older version" means "no automatic update"; hence, when users select a non-latest Operator version, this implies the "Update" field would be changed to "Manual".
  • Operator details sidebar data will update based on the selected channel. `createdAt` `containerImage` and `capability level`

Out of scope:

  • provide a version selector for updatres in case of existing installed operators

 

Related info

UX designs: http://openshift.github.io/openshift-origin-design/designs/administrator/olm/select-install-operator-version/
linked OLM jira: https://issues.redhat.com/browse/OPRUN-1399
where you can see the downstream PR: https://github.com/openshift/operator-framework-olm/pull/437/files
specifically: https://github.com/awgreene/operator-framework-olm/blob/f430b2fdea8bedd177550c95ec[…]r/pkg/package-server/apis/operators/v1/packagemanifest_types.go i.e., you can get a list of available versions in PackageChannel stanza from the packagemanifest API
You can reach out to OLM lead Alex Greene for any question regarding this too, thanks

 

 

1. Proposed title of this feature request

    Add a scroll bar for the resource list in the Uninstall Operator pops-up window
2. What is the nature and description of the request?

   To make user easy to check the list of all resources
3. Why does the customer need this? (List the business requirements here)

   For customers, one operator may have multiple resources, it would be easy for them to check them all in Uninstall Operator pops-up window with the scroll bar
4. List any affected packages or components.

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster. 
Why customers want this?

  1. Single interface to accomplish their tasks
  2. Consistent UX and patterns
  3. Easily accessible: One URL, one set of credentials

Why we want this?

  • Shared code -  improve the velocity of both teams and most importantly ensure consistency of the experience at the code level
  • Pre-built PF4 components
  • Accessibility & i18n
  • Remove barriers for enabling ACM

Phase 2 Goal: Productization of the united Console 

  1. Enable user to quickly change context from fleet view to single cluster view
    1. Add Cluster selector with “All Cluster” Option. “All Cluster” = ACM
    2. Shared SSO across the fleet
    3. Hub OCP Console can connect to remote clusters API
    4. When ACM Installed the user starts from the fleet overview aka “All Clusters”
  2. Share UX between views
    1. ACM Search —> resource list across fleet -> resource details that are consistent with single cluster details view
    2. Add Cluster List to OCP —> Create Cluster

We need a way to show metrics for workloads running on spoke clusters. This depends on ACM-876, which lets the console discover the monitoring endpoints.

  • Console operator must discover the external URLs for monitoring
  • Console operator must pass the URLs and CA files as part of the cluster config to the console backend
  • Console backend must set up proxies for each endpoint (as it does for the API server endpoints)
  • Console frontend must include the cluster in metrics requests

Open Issues:

We will depend on ACM to create a route on each spoke cluster for the prometheus tenancy service, which is required for metrics for normal users.

 

Openshift console backend should proxy managed cluster monitoring requests through the MCE cluster proxy addon to prometheus services on the managed cluster. This depends on https://issues.redhat.com/browse/ACM-1188

 

BU Priority Overview

Initiative: Improve etcd disaster recovery experience (part1)

Goals

The current etcd backup and recovery process is described in our docs https://docs.openshift.com/container-platform/4.12/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html

The current process leaves up to the cluster-admin to figure out a way to do consistent backups following the documented procedure.

This feature is part of a progressive delivery to improve the cluster-admin experience for backup and restore of etcd clusters to a healthy state.

Scope of this feature:

  • etcd quorum loss (2 node failure) on a 3 nodes OCP control plane
  • etcd degradation (1 node failure) on a 3 nodes OCP control plane

Execution Plans

  • Improve etcd disaster recovery e2e test coverage
  • Design automated backup API. Initial target is local destination
  • Should provide a way (e.g. script or tool) for cluster-admin to validate backup files remains valid over time (e.g. account for disk failures corrupting the backup)
  • Should document updated manual steps to restore from local backup. These steps should be part of the e2e test coverage.
  • Should document manual manual steps to copy backups files to destination outside the cluster. (e.g. ssh copy a cluster admin can use in a CronJob)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

For testing the automated backups feature we will require an e2e test that validates the backups by ensuring the restore procedure works for a quorum loss disaster recovery scenario.

See the following doc for more background:
https://docs.google.com/document/d/1NkdOwo53mkNBCktV5tkUnbM4vi7bG4fO5rwMR0wGSw8/edit?usp=sharing

This story targets the milestone 2,3 and 4 of the restore test to ensure that the test has the ability to perform a backup and then restore from that backup in a disaster recovery scenario.

While the automated backups API is still in progress, the test will rely on the existing backup script to trigger a backup. Later on when we have a functional backup API behind a feature gate, the test can switch over to using that API to trigger backups.

We're starting with a basic crash-looping member restore first. The quorum loss scenario will be done in ETCD-423.

Given that we have a controller that processes one time etcd backup requests via the "operator.openshift.io/v1alpha1 EtcdBackup" CR, we need another controller that processes the "config.openshift.io/v1alpha1 Backup" CR so we can have periodic backups according the the schedule in the CR spec.

See https://github.com/openshift/api/pull/1482 for the APIs

The workflow for this controller should roughly be:

  • Watches the `config.openshift.io/v1alpha1 Backup` CR as created by an admin
  • Creates a CronJob for the specified schedule and timezone that would in turn create `operator.openshift.io/v1alpha1 EtcdBackup` CRs at the desired schedule
  • Updates the CronJob for any changes in the schedule or timezone

Along with this controller we would also need to provide the workload or Go command for the pod that is created periodically by the CronJob. This cmd e.g "create-etcdbackup-cr" effectively creates a new `operator.openshift.io/v1alpha1 EtcdBackup` CR via the following workflow:

  • Read the Backup CR to get the pvcName (and anything else) required to populate an `EtcdBackup` CR
  • Create the `operator.openshift.io/v1alpha1 EtcdBackup` CR

Lastly to fulfill the retention policy (None, number of backups saved, or total size of backups), we can employ the following workflow:

  • Have another command e.g "prune-backups" cmd that runs prior to the "create-etcdbackup-cr" command that deletes existing backups per the retention policy.
  • This cmd is run before the cmd to create the etcdbackup CR. This could be done via an init container on the CronJob execution pod.
  • This would require the backup controller to populate the CronJob spec with the pvc name from the Backup spec that would allowing mounting the PV on the execution pod for pruning the backups in the init container.

See the parent story for more context.
As the first part to this story we need a controller with the following workflow:

  • Watches the `config.openshift.io/v1alpha1 Backup` CR as created by an admin
  • Creates a CronJob for the specified schedule and timezone that would ultimately create `operator.openshift.io/v1alpha1 EtcdBackup` CRs at the desired schedule
  • Updates the CronJob for any changes in the schedule or timezone

Since we also want to preserve a history of successful and failed backup attempts for the periodic config, the CronJob should utilize cronjob history limits to preserve successful and failed jobs.
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#jobs-history-limits

To begin with we can set this to a reasonable default of 5 successful and 10 failed jobs.

 

Lastly to fulfill the retention policy (None, number of backups saved, or total size of backups), we can employ the following workflow:

  • Have another command e.g "prune-backups" cmd that runs prior to the "create-etcdbackup-cr" command that deletes existing backups per the retention policy.
  • The retention policy type can either be read from the `config.openshift.io/v1alpha1 Backup` CR
    • Or easier yet, the backup controller can pass set the retention policy arg in the CronJob template spec
  • This cmd is run before the cmd to create the etcdbackup CR. This could be done via an init container on the CronJob execution pod.
  • This would require the backup controller to populate the CronJob spec with the pvc name from the Backup spec that would allowing mounting the PV on the execution pod for pruning the backups in the init container.

For testing the automated backups feature we will require an e2e test that validates the backups by ensuring the restore procedure works for a quorum loss disaster recovery scenario.

See the following doc for more background:
https://docs.google.com/document/d/1NkdOwo53mkNBCktV5tkUnbM4vi7bG4fO5rwMR0wGSw8/edit?usp=sharing

This story targets the first milestone of the restore test to ensure we have a platform agnostic way to be able to ssh access all masters in a test cluster so that we can perform the necessary backup, restore and validation workflows.

The suggested approach is to create a static pod that can do those ssh checks and actions from within the cluster but other alternatives can also be explored as part of this story. 

To fulfill one time backup requests there needs to be a new controller that reconciles an EtcdBackup CustomResource (CR) object and executes and saves a one time backup of the etcd cluster.
 
Similar to the upgradebackupcontroller the controller would be triggered to create a backup pod/job which would save the backup to the PersistentVolume specified by the spec of the EtcdBackup CR object.

The controller would also need to honor the retention policy specified by the EtcdBackup spec and update the status accordingly.

See the following enhancement and API PRs for more details and potential updates to the API and workflow for the one time backup:
https://github.com/openshift/enhancements/pull/1370
https://github.com/openshift/api/pull/1482

We should add some basic backup e2e tests into our operator:

  • one off backups can be run via API
  • periodic backups can be run (also multiple times in succession)
    • retention should work

The e2e workflow should be TechPreview enabled already. 

 

Feature Overview

This feature aims to enhance and clarify the functionalities of the Hypershift CLI. It was initially developed as a developer tool, but as its purpose evolved, a mix of supported and unsupported features were included. This has caused confusion for users who attempt to utilize unsupported functionalities. The goal is to clearly define the boundaries of what is possible and what is supported by the product.

Goals

Users should be able to effectively and efficiently use the Hypershift CLI with a clear understanding of what features are supported and what are not. This should reduce confusion and complications when utilizing the tool.

Requirements (aka. Acceptance Criteria):

Clear differentiation between supported and unsupported functionalities within the Hypershift CLI.
Improved documentation outlining the supported CLI options.
Consistency between the Hypershift CLI and the quickstart guide on the UI.
Security, reliability, performance, maintainability, scalability, and usability must not be compromised while implementing these changes.

Use Cases (Optional):

A developer uses the hypershift install command and only supported features are executed.
A user attempts to create a cluster using hypershift cluster create, and the command defaults to a compatible release image.

Questions to Answer (Optional):

What is the most efficient method for differentiating supported and unsupported features within the Hypershift CLI?
What changes need to be made to the documentation to clearly outline supported CLI options?

Out of Scope

Changing the fundamental functionality of the Hypershift CLI.
Adding additional features beyond the scope of addressing the current issues.

Background

The Hypershift CLI started as a developer tool but evolved to include a mix of supported and unsupported features. This has led to confusion among users and potential complications when using the tool. This feature aims to clearly define what is and isn't supported by the product.

Customer Considerations

Customers should be educated about the changes to the Hypershift CLI and its intended use. Clear communication about supported and unsupported features will help them utilize the tool effectively.

Documentation Considerations

Documentation should be updated to clearly outline supported CLI options. This will be a crucial part of user education and should be easy to understand and follow.

Interoperability Considerations

This feature may impact the usage of Hypershift CLI across other projects and versions. A clear understanding of these impacts and planning for necessary interoperability test scenarios should be factored in during development.

Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a user of HCP CLI, I want to be able to set some platform agnostic default flags when creating a HostedCluster:

  • additional-trust-bundle
  • annotations
  • arch
  • auto-repair
  • base-domain
  • cluster-cidr
  • control-plane-availability-policy
  • etcd-storage-class
  • fips
  • generate-ssh
  • image-content-sources
  • infra-availability-policy
  • infra-id
  • infra-json
  • name
  • namespace
  • node-drain-timeout
  • node-selector
  • node-upgrade-type
  • network-type
  • release-stream
  • render
  • service-cidr
  • ssh-key
  • timeout
  • wait

so that I can set default values for these flags for my particular use cases.

Acceptance Criteria:

Description of criteria:

  • Aforementioned flags are included in the HCP CLI general create cluster command.
  • Aforementioned flags are included in test plans & testing.

Out of Scope:

The flags listed in HyperShift Create Cluster CLI that don't seem platform agnostic:

  • BaseDomainPrefix - only in AWS
  • ExternalDNSDomain - only in AWS

These flags are also out of scope:

  • control-plane-operator-image - for devs (see Alberto's comment below)

Engineering Details:

  • N/A

This requires/does not require a design proposal.
This requires/does not require a feature gate.

As a HyperShift user I want to:

  • Have a convenient command that destroys an Agent cluster I deployed

Definition of done:

  • hypershift destroy cluster agent exists and destroys an agent hosted cluster
  • QE test plan that uses the destroy cluster agent command

As a self-managed HyperShift user I want to have a CLI tool that allows me to:

  • Create the necessary HyperShift API custom resources for hosted cluster creation

Definition of done:

  • hypershift create cluster aws exists and has only the relevant needed flags for what we support bare metal with the cluster api agent provider
  • Unit tests
  • cluster creation test plan in QE

As a self-managed HyperShift user I want to have a CLI tool that allows me to:

  • Create the necessary HyperShift API custom resources for hosted cluster creation

Definition of done:

  • hypershift create cluster kubevirt exists and has only the relevant needed flags for what we support in Kubevirt
  • Unit tests
  • cluster creation test plan in QE (ECODEPQE pipeline)

As a HyperShift user I want to:

  • Have a convenient command that destroys an AWS cluster I deployed

Definition of done:

  • hypershift destroy cluster aws exists and destroys an AWS hosted cluster
  • QE test plan that uses the destroy cluster aws command

As a self-managed HyperShift user I want to have a CLI tool that allows me to:

  • Create the necessary HyperShift API custom resources for hosted cluster Nodepool creation

Definition of done:

  • hypershift create nodepool aws exists and has only the relevant needed flags for what we support in AWS
  • Unit tests
  • cluster creation with nodepool creation test plan in QE

As a self-managed HyperShift user I want to have a CLI tool that allows me to:

  • Create the necessary HyperShift API custom resources for hosted cluster Nodepool creation

Definition of done:

  • hypershift create nodepool agent exists and has only the relevant needed flags for what we support bare metal with the cluster api agent provider
  • Unit tests
  • cluster creation with nodepool creation test plan in QE (ECODEPQE pipeline)

As a HyperShift user I want to:

  • Have a convenient command that destroys a kubevirt cluster I deployed

Definition of done:

  • hypershift destroy cluster kubevirt exists and destroys a Kubevirt hosted cluster
  • QE test plan that uses the destroy cluster kubevirt command

As a self-managed HyperShift user I want to have a CLI tool that allows me to:

  • Create the necessary HyperShift API custom resources for hosted cluster creation

Definition of done:

  • hypershift create cluster agent exists and has only the relevant needed flags for what we support bare metal with the cluster api agent provider
  • Unit tests
  • cluster creation test plan in QE (ECODEPQE pipeline)

As a self-managed HyperShift user I want to have a CLI tool that allows me to:

  • Create the necessary HyperShift API custom resources for hosted cluster Nodepool creation

Definition of done:

  • hypershift create nodepool kubevirt exists and has only the relevant needed flags for what we support in kubevirt
  • Unit tests
  • cluster creation with nodepool creation test plan in QE (ECODEPQE pipeline)

As a software developer and user of HyperShift CLI, I would like a prototype of how the Makefile can be modified to build different versions of the HyperShift CLI, i.e., dev version vs productized version.

As a HyperShift user I want to:

  • Have a convenient command that generates the kubeconfig file to access the hosted cluster I just deployed

Definition of done:

  • hypershift kubeconfig create exists and generates a kubeconfig file that is valid to access the deployed hosted cluster
  • QE test plan that uses the kubeconfig generation

< High-Level description of the feature ie: Executive Summary >

Goals

< Who benefits from this feature, and how? What is the difference between today's current state and a world with this feature? >

Requirements

Requirements Notes IS MVP
     
    • (Optional) Use Cases

< What are we making, for who, and why/what problem are we solving?>

Out of scope

<Defines what is not included in this story>

Dependencies

< Link or at least explain any known dependencies. >

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

What does success look like?

< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>

QE Contact

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Impact

< If the feature is ordered with other work, state the impact of this feature on the other work>

Related Architecture/Technical Documents

<links>

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in production environment

What's the problem

Currently pipeline builder in dev console directly queries tekton hub APIs for searching tasks. As upstream community and Red Hat is moving to artifacthub, we need to query artifacthub API for searching tasks.

Acceptance criteria

  1. Update the pipeline builder code so that if the API to retrieve tasks is not available, there will be no errors in the UI.
  2. Perform a spike to estimate the amount of work it will take to have the pipeline builder use the artifact hub API to retrieve tasks, rather than using the tekton hub API.

Description

Hitting the Artifacthub.io search endpoint fails sometimes due to a CORS error and the Version API endpoint always fails due to a CORS error. So, we need a Proxy to hit the Artifacthub. end point to get the data.

Acceptance Criteria

  1. Create a proxy to hit the Artifacthub.io endpoint.

Additional Details:

Search endpoint: https://artifacthub.io/docs/api/#/Packages/searchPackages

eg.: https://artifacthub.io/api/v1/packages/search?offset=0&limit=20&facets=false&ts_query_web=git&kind=7&deprecated=false&sort=relevance

Version endpoint: https://artifacthub.io/docs/api/#/Packages/getTektonTaskVersionDetails

eg: https://artifacthub.io/api/v1/packages/tekton-task/tekton-catalog-tasks/git-clone/0.9.0

 

Feature Overview (aka. Goal Summary):

 

This feature will allow an x86 control plane to operate with compute nodes of type Arm in a HyperShift environment.

 

Goals (aka. expected user outcomes):

 

Enable an x86 control plane to operate with an Arm data-plane in a HyperShift environment.

 

Requirements (aka. Acceptance Criteria):

 

  • The feature must allow an x86 control plane and an Arm data-plane to be used together in a HyperShift environment.
  • The feature must provide documentation on how to set up and use the x86 control plane with an Arm data-plane in a HyperShift environment.
  • The feature must be tested and verified to work reliably and securely in a production environment.

 

Customer Considerations:

 

Customers who require a mix of x86 control plane and Arm data-plane for their HyperShift environment will benefit from this feature.

 

Documentation Considerations:

 

  • Documentation should include clear instructions on how to set up and use the x86 control plane with an Arm data-plane in a HyperShift environment.
  • Documentation will live on docs.openshift.com

 

Interoperability Considerations:

 

This feature should not impact other OpenShift layered products and versions in the portfolio.

Goal

Numerous partners are asking for ways to pre-image servers in some central location before shipping them to an edge site where they can be configured as an OpenShift cluster: OpenShift-based Appliance.

A number of these cases are a good fit for a solution based on writing an image equivalent to the agent ISO, but without the cluster configuration, to disk at the central location and then configuring and running the installation when the servers reach their final location. (Notably, some others are not a good fit, and will require OpenShift to be fully installed, using the Agent-based installer or another, at the central location.)

While each partner will require a different image, usually incorporating some of their own software to drive the process as well, some basic building blocks of the image pipeline will be widely shared across partners.

Extended documentation

OpenShift-based Appliance

Building Blocks for Agent-based Installer Partner Solutions

Interactive Workflow work (OCPBU-132)

This work must "avoid conflict with the requirements for any future interactive workflow (see Interactive Agent Installer), and build towards it where the requirements coincide. This includes a graphical user interface (future assisted installer consistency).

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Allow the user to use the openshift-installer to generate a configuration ISO that they can attach to a server running the unconfigured agent ISO from AGENT-558. This would act as alternative to the GUI, effectively leaving the interactive flow and rejoining the automation flow by doing an automatic installation using the configuration contained on the ISO.

Why is this important?

  • Helps standardise implementations of the automation flow where an agent ISO image is pre-installed on a physical disk.

Scenarios

  1. The user purchases hardware with a pre-installed unconfigured agent image. They use openshift-installer to generate a config ISO from an install config, and attach this ISO to the server as virtual media to a group of servers to cause them to install OpenShift and form a cluster.
  2. The user has a pool of servers that share the same boot mechanism (e.g. PXE). Each server is booted from a common interactive agent image, and automation can install any subset of them as a cluster by attaching the same configuration ISO to each.
  3. A cloud user could boot a group of VMs using a publicly-available unconfigured agent image (e.g. an AMI), and install them as a cluster by attaching a configuration ISO to them.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. AGENT-556 - we'll need to block startup of services until configuration is provided
  2. AGENT-558 - this won't be useful without an unconfigured image to use it with
  3. AGENT-560 - enables AGENT-556 to block in an image generated with AGENT-558

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Implement a systemd service in the unconfigured agent ISO (AGENT-558) that watches for disks to be mounted, then searches them for agent installer configuration. If such configuration is found, then copy it to the relevant places in the running system.

The rendezvousIP must be copied last, as the presence of this is what will trigger the services to start (AGENT-556).

To the extent possible, the service should be agnostic as to the method by which the config disk was mounted (e.g. virtual media, USB stick, floppy disk, &c.). It may be possible to get systemd to trigger on volume mount, avoiding the need to poll anything.

The configuration drive must contain:

  • rendezvousIP config file
  • ClusterDeployment manifest
  • AgentPullSecret manifest
  • AgentClusterInstall manifest
  • TLS certs for admin kubeconfig
  • password hash for kubeadmin console password
  • ClusterImageSet manifest (for version verification)

it may optionally contain:

  • NMStateConfig
  • extra manifests
  • hostnames
  • hostconfig (roles, root device hints)

The ClusterImageSet manifest must match the one already present in the image for the config to be accepted.

Add a new installer subcommand, openshift-install agent create config-image.

The should create a small ISO (i.e. not a CoreOS boot image) containing just the configuration files from the automation flow:

  • rendezvousIP config file
  • ClusterDeployment manifest
  • AgentPullSecret manifest
  • AgentClusterInstall manifest
  • TLS certs for admin kubeconfig
  • password hash for kubeadmin console password
  • NMStateConfig
  • extra manifests
  • hostnames
  • hostconfig (roles, root device hints)
  • ClusterImageSet manifest (for version verification)

The contents in the disk could be in any format, but should be optimised to make it simple for the service in AGENT-562 to read.

Support pd-balanced disk types for GCP deployments

OpenShift installer and Machine API should support creation and management of computing resources with disk type "pd-balanced"

Why does the customer need this?

  • pd-balanced are ssd disks with performances comparable to pd-ssd but with a lower price

Epic Goal

  • Support pd-balanced disk types for GCP deployments

Why is this important?

  • Customers will be able to reduce costs on GCP while using `pd-balanced` disk types with a comparable performance to `pd-ssd` ones.

Scenarios

  1. Enable `pd-balanced` disk types when deploying a cluster in GCP. Right now only `pd-ssd` and `pd-standard` are supported.

Overview:

  • To enable support for pd-balanced disk types during cluster deployment in Google Cloud Platform (GCP) for Openshift Installer.
  • Currently, only pd-ssd and pd-standard disk types are supported.
  • `pd-balanced` disks on GCP will offer cost reduction and comparable performance to `pd-ssd` disks, providing increased flexibility and performance for deployments.

Acceptance Criteria:

  • The Openshift Installer should be updated to include pd-balanced as a valid disk type option in the installer configuration process.
  • When pd-balanced disk type is selected during cluster deployment, the installer should handle the configuration of the disks accordingly.
  • CI (Continuous Integration) must be running successfully with tests automated.
  • Release Technical Enablement details and documents should be provided.

Done Checklist:

  • CI is running, tests are automated, and merged.
  • Release Enablement Presentation: [link to Feature Enablement Presentation].
  • Upstream code and tests merged: [link to meaningful PR or GitHub Issue].
  • Upstream documentation merged: [link to meaningful PR or GitHub Issue].
  • Downstream build attached to advisory: [link to errata].
  • Test plans in Polarion: [link or reference to Polarion].
  • Automated tests merged: [link or reference to automated tests].
  • Downstream documentation merged: [link to meaningful PR].

Dependencies:

  • Google Cloud Platform Account
  • Access to GCP ‘Installer’ Project
  • Any required permissions, authentication, access controls or CLI needed to provision pd-balanced disk types should be properly configured.

Testing:

  • Develop and conduct test cases and scenarios to verify the proper functioning of pd-balanced disk type implementation.
  • Address any bugs or issues identified during testing.

Documentation:

  • Update documentation to reflect the support for pd-balanced disk types in GCP deployments.

Success Metrics:

  • Successful deployment of Openshift clusters using the pd-balanced disk type in GCP.
  • Minimal or no disruption to existing functionality and deployment options.

Feature Overview

  • Enable user custom RHCOS images location for Installer IPI provisioned OpenShift clusters on Google Cloud and Azure

Goals

  • The Installer to accept custom locations for RHCOS images while deploying OpenShift on Google Cloud and Azure as we support already for AWS via `platform.aws.amiID` for control plane and compute nodes.
  • As a user, I want to be able to specify a custom RHCOS image location to be used for control plane and compute nodes while deploying OpenShift on Google Cloud and Azure so that I cab be complaint with my company security policies.

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES
  •  

Background, and strategic fit

Many enterprises have strict security policies where all the software must be pulled from a trusted or private source. For these scenarios the RHCOS image used to bootstrap the cluster is usually coming from shared public locations that some companies don't accept as a trusted source.

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Simplify ARO's workflow by allowing Azure marketplace images to be specified in the `install-config.yaml` for all nodes (compute, control plane, and bootstrap).

Why is this important?

  • ARO is a first party Azure service and has a number of requirements/restrictions. These requirements include the following: it must not request anything from outside of Azure and it must consume RHCOS VM images from a trusted source (marketplace).
  • At the same time upstream OCP does the following:
    1. It uses quay.io to get container images.
    2. Uses a random blob as a RHCOS VM image such as this. This VHD blob is then uploaded by the Installer to an Image Gallery in the user’s Storage Account where two boot images are created: a HyperV gen1 and a HyperV gen2. See here.
      To meet the requirements ARO team currently does the following as part of the release process:
    1. Mirror container images from quay.io to Azure Container Registry to avoid leaving Azure boundaries.
    2. Copy VM image from the blob in someone else's Azure subscription into the blob on the subscription ARO team manages and then publish a VM image on Azure Marketplace (publisher: azureopenshift, offer: aro4. See az vm image list --publisher azureopenshift --all). ARO does not bill for these images.
  • ARO has to carry their own changes on top of the Installer code to allow them to specify their own images for the cluster deployment.

Scenarios

  1. ...

Acceptance Criteria

  • Custom RHCOS images can be specified in the install-config for compute, controlPlane and defaultMachinePlatform and they are used for the installation instead of the default RHCOS VHD.

Out of scope

  • A VHD blob will still be uploaded to the user's Storage Account even though it won't be used during installation. That cannot be changed for now.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

ARO needs to copy RHCOS image blobs to their own Azure Marketplace offering since, as a first party Azure service, they must not request anything from outside of Azure and must consume RHCOS VM images from a trusted source (marketplace).
To meet the requirements ARO team currently does the following as part of the release process:

 1. Mirror container images from quay.io to Azure Container Registry to avoid leaving Azure boundaries.
 2. Copy VM image from the blob in someone else's Azure subscription
 into the blob on the subscription ARO team manages and then we publish a VM image on Azure Marketplace (publisher: azureopenshift, offer: aro4. See az vm image list --publisher azureopenshift --all). We do not bill for these images.

The usage of Marketplace images in the installer was already implemented as part of CORS-1823. This single line [1] needs to be refactored to enable ARO from the installer code perspective: on ARO we don't need to set type to AzureImageTypeMarketplaceWithPlan.

However, in OCPPLAN-7556 and related CORS-1823 it was mentioned that using Marketplace images is out of scope for nodes other than compute. For ARO we need to be able to use marketplace images for all nodes.

[1] https://github.com/openshift/installer/blob/f912534f12491721e3874e2bf64f7fa8d44aa7f5/pkg/asset/machines/azure/machines.go#L107

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Set RHCOS image from Azure Marketplace in the installconfig
2. Deploy a cluster
3.

Actual results:

Only compute nodes use the Marketplace image.

Expected results:

All nodes created by the Installer use RHCOS image coming from Azure Marketplace.

Additional info:

 

 

Epic Goal

  • As a customer, I need to make sure that the RHCOS image I leverage is coming from a trusted source. 

Why is this important?

  • For customer who have a very restricted security policies imposed by their InfoSec teams they need to be able to manually specify a custom location for the RHCOS image to use for the Cluster Nodes.

Scenarios

  1. As a customer, I want to specify a custom location for the RHCOS image to be used for the cluster Nodes

Acceptance Criteria

A user is able to specify a custom location in the Installer manifest for the RHCOS image to be used for bootstrap and cluster Nodes. This is the similar approach we support already for AWS with the compute.platform.aws.amiID option

Previous Work (Optional):

https://issues.redhat.com/browse/CORS-1103

 

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

 

 

 

 

 

User Story:

Some background on the Licenses field:

https://github.com/openshift/installer/pull/3808#issuecomment-663153787

https://github.com/openshift/installer/pull/4696

So we do not want to allow licenses to be specified (it's up to customers to create a custom image with licenses embedded and supply that to the Installer) when pre-built images are specified (current behaviour). Since we don't need to specify licenses for RHCOs images anymore, the Licenses field is useless and should be deprecated.

Acceptance Criteria:

Description of criteria:

  • License field deprecated
  • Any dev docs mentioning Licenses is updated.

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a user, I want to be able to:

  • Specify a RHCOS image coming from a custom source in the install config to override the installer's internal choice of bootimage  

so that I can achieve

  • a custom location in the install config for the RHCOS image to use for the Cluster Nodes

Acceptance Criteria:

A user is able to specify a custom location in the Installer manifest for the RHCOS image to be used for bootstrap and cluster Nodes. This is the similar approach we support already for AWS with the compute.platform.aws.amiID option

(optional) Out of Scope:

 

Engineering Details:

  •  

Epic Goal

  • Enable the migration from a storage intree driver to a CSI based driver with minimal impact to the end user, applications and cluster
  • These migrations would include, but are not limited to:
    • CSI driver for Azure (file and disk)
    • CSI driver for VMware vSphere

Why is this important?

  • OpenShift needs to maintain it's ability to enable PVCs and PVs of the main storage types
  • CSI Migration is getting close to GA, we need to have the feature fully tested and enabled in OpenShift
  • Upstream intree drivers are being deprecated to make way for the CSI drivers prior to intree driver removal

Scenarios

  1. User initiated move to from intree to CSI driver
  2. Upgrade initiated move from intree to CSI driver
  3. Upgrade from EUS to EUS

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal*

Kubernetes upstream has chosen to allow users to opt-out from CSI volume migration in Kubernetes 1.26 (1.27 PR, 1.26 backport). It is still GA there, but allows opt-out due to non-trivial risk with late CSI driver availability.

We want a similar capability in OCP - a cluster admin should be able to opt-in to CSI migration on vSphere in 4.13. Once they opt-in, they can't opt-out (at least in this epic).

Why is this important? (mandatory)

See an internal OCP doc if / how we should allow a similar opt-in/opt-out in OCP.

 
Scenarios (mandatory) 

Upgrade

  1. Admin upgrades 4.12 -> 4.13 as usual
  2. Storage CR has CSI migration disabled (or nil), in-tree volume plugin handles in-tree PVs.
  3. At the same time, external CCM runs, however, due to kubelet running with –cloud-provider=vsphere, it does not do kubelet’s job.
  1. Admin can opt-in to CSI migration by editing Storage CR. That enables OPENSHIFT_DO_VSPHERE_MIGRATION env. var. everywhere + runs kubelet with –cloud-provider=external.
    1. If we have time, it should not be hard to opt out, just remove the env. var + update kubelet cmdline. Storage / in-tree volume plugin will handle in-tree PVs again, not sure about implications on external CCM.
  2. Once opted-in, it’s not possible to opt out.
  1. Both with opt-in and without it, the cluster is Upgradeable=true. Admin can upgrade to 4.14, CSI migration will be forced there.

 

New install

  1. Admin installs a new 4.13 vSphere cluster, with UPI, IPI, Assisted Installer, or Agent-based Installer.
  2. During installation, Storage CR is created with CSI migration enabled
  3. (We want to have it enabled for a new cluster to enable external CCM and have zonal.  This avoids new clusters from having in-tree as default and then having to go through migration later.)
  4. Resulting cluster has OPENSHIFT_DO_VSPHERE_MIGRATION env. var set + kubelet with –cloud-provider=external + topology support.
  5. Admin cannot opt-out after installation, we expect that they use CSI volumes for everything.
  1. If the admin really wants, they can opt-out before installation by adding a Storage install manifest with CSI migration disabled.

 

EUS to EUS (4.12 -> 4.14)

  • Will have CSI migration enabled once in 4.14
  • During the upgrade, a cluster will have 4.13 masters with CSI migration disabled (see regular upgrade to 4.13 above) + 4.12 kubelets.
  • Once the masters are 4.14, CSI migration is force-enabled there, still, 4.14 KCM + in-tree volume plugin in it will handle in-tree volume attachments required by kubelets that still have 4.12 (that’s what kcm --external-cloud-volume-plugin=vsphere does).
  • Once both masters + kubelets are 4.14, CSI migration is force enabled everywhere, in-tree volume plugin + cloud provider in KCM is still enabled by --external-cloud-volume-plugin, but it’s not used.
  • Keep in-tree storage class by default
  • A CSI storage class is already available since 4.10
  • Recommend to switch default to CSI
  • Can’t opt out from migration
    Dependencies (internal and external) (mandatory)
  • We need a new FeatureSet in openshift/api that disables CSIMigrationvSphere feature gate.
  • We need kube-apiserver-operator, kube-controller-manager-operator, kube-scheduler-operator, MCO must reconfigure their operands to use in-tree vSphere cloud provider when they see CSIMigrationvSphere FeatureGate disabled.
  • We need cloud controller manager operator to disable its operand when it sees CSIMigrationvSphere FeatureGate disabled.

Contributing Teams(and contacts) (mandatory) 

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

When CSIMigrationvSphere is disabled, cluster-storage-operator must re-create in-tree StorageClass.

vmware-vsphere-csi-driver-operator's StorageClass must not be marked as the default there (IMO we already have code for that).

This also means we need to fix the Disable SC e2e test to ignore StorageClasses for the in-tree driver. Otherwise we will reintroduce OCPBUGS-7623.

Feature Overview

  • Customers want to create and manage OpenShift clusters using managed identities for Azure resources for authentication.

Goals

  • A customer using ARO wants to spin up an OpenShift cluster with "az aro create" without needing additional input, i.e. without the need for an AD account or service principal credentials, and the identity used is never visible to the customer and cannot appear in the cluster.
  • As an administrator, I want to deploy OpenShift 4 and run Operators on Azure using access controls (IAM roles) with temporary, limited privilege credentials.

Requirements

  • Azure managed identities must work for installation with all install methods including IPI and UPI, work with upgrades, and day-to-day cluster lifecycle operations.
  • Support HyperShift and non-HyperShift clusters.
  • Support use of Operators with Azure managed identities.
  • Support in all Azure regions where Azure managed identity is available. Note: Federated credentials is associated with Azure Managed Identity, and federated credentials is not available in all Azure regions.

More details at ARO managed identity scope and impact.

 

This Section: A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

References

Epic Goal

  • CIRO can consume azure workload identity tokens
  • CIRO's Azure credential request uses new API field for requesting permissions

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

 

ACCEPTANCE CRITERIA

  • Upstream distribution/distribution uses azure identity sdk 1.3.0
  • openshift/docker-distribution uses the latest upstream distribution/distribution (after the above has merged)
  • Green CI
  • Every storage driver passes regression tests

OPEN QUESTIONS

  • Can DefaultAzureCredential be relied on to transparently use workload identities? (in this case the operator would need to export environment varialbes that DefaultAzureCredential expects for workload identities)
    • I have tested manually exporting the required env vars and DefaultAzureCredential correctly detects and attempts to authenticate using federated workload identity, so it works as expected.

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

 

ACCEPTANCE CRITERIA

  • image-registry uses latest openshift/docker-distribution
  • CIRO can detect when the creds it gets from CCO are for federated workload identity (the credentials secret will contain a "azure_federated_token_file")
  • when using federated workload identity, CIRO adds the "AZURE_FEDERATED_TOKEN_FILE" env var to the image-registry deployment
  • when using federated workload identity, CIRO does not add the "REGISTRY_STORAGE_AZURE_ACCOUNTKEY" env var to the image-registry deployment
  • the image-registry operates normally when using federated workload identity

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

 

ACCEPTANCE CRITERIA

  • CIRO should retrieve the "azure_resourcegroup" from the cluster Infrastructure object instead of the CCO created secret (this key will not be present when workload identity is in use)
  • CIRO's CredentialsRequest specifies the service account names (see the: cluster-storage-operator for an example)
  • CIRO is able to create storage accounts and containers when configured with azure workload identity.

Epic Overview

  • Enable customers to create and manage OpenShift clusters using managed identities for Azure resources for authentication.
  • A customer using ARO wants to spin up an OpenShift cluster with "az aro create" without needing additional input, i.e. without the need for an AD account or service principal credentials, and the identity used is never visible to the customer and cannot appear in the cluster.

Epic Goal

  • A customer creates an OpenShift cluster ("az aro create") using Azure managed identity.
  • Azure managed identities must work for installation with all install methods including IPI and UPI, work with upgrades, and day-to-day cluster lifecycle operations.
  • After Azure failed to implement workable golang API changes after deprecation of their old API, we have removed mint mode and work entirely in passthrough mode. Azure has plans to implement pod/workload identity similar to how they have been implemented in AWS and GCP, and when this feature is available, we should implement permissions similar to AWS/GCP
  • This work cannot start until Azure have implemented this feature - as such, this Epic is a placeholder to track the effort when available.

Why is this important?

  • Microsoft and the customer would prefer that we use Managed Identities vs. Service Principal (which requires putting the Service Principal and principal password in clear text within the azure.conf file).

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

 

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

Create a config secret in the openshift-cloud-credential-operator namespace which contains the AZURE_TENANT_ID to be used for configuring the Azure AD pod identity webhook deployment.

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

These docs should cover:

  • A general overview of the feature, what changes are made to Azure credentials secrets and how to install a new cluster.
  • A usage guide of `ccoctl azure` commands to create/manage infra required for Azure workload identity.

See existing documentation for:

This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.

  • azure-sdk-for-go module dependency updated to support workload identity federation.
  • Mount the OIDC token in the operator pod. This needs to go in the deployment. See example from addition to the cluster-image-registry-operator here

Epic Goal

  • Enable the OpenShift Installer to authenticate using authentication methods supported by both the azure sdk for go and the terraform azure provider
  • Future proofing to enable Terraform support for workload identity authentication when it is enabled upstream

Why is this important?

  • This ties in to the larger OpenShift goal of: as an infrastructure owner, I want to deploy OpenShift on Azure using Azure Managed Identities (vs. using Azure Service Principal) for authentication and authorization.
  • Customers want support for using Azure managed identities in lieu of using an Azure service principal. In the OpenShift documentation, we are directed to use an Azure Service Principal - "Azure offers the ability to create service accounts, which access, manage, or create components within Azure. The service account grants API access to specific services". However, Microsoft and the customer would prefer that we use User Managed Identities to keep from putting the Service Principal and principal password in clear text within the azure.conf file. 
  • See https://docs.microsoft.com/en-us/azure/active-directory/develop/workload-identity-federation for additional information.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a cluster admin I want to be able to:

  • use the managed identity from the installer host VM (running in Azure)

so that I can

  • install a cluster without copying credentials to the installer host

Acceptance Criteria:

Description of criteria:

  • Installer (azure sdk) & terraform authenticate using identity from host VM (not client secret in file ~/.azure/servicePrincipal.json)
  • Cluster credential is handled appropriately (presumably we force manual mode)

Engineering Details:

Epic Goal

  • Build list of specific permissions to run Openshift on Azure - Components grant roles, but we need more granularity.
  • Determine and document the Azure roles and required permissions for Azure managed identity.

Why is this important?

  • Many of our customers have security policies in their organization that restrict credentials to only minimal permissions that conflict with the documented list of permissions needed for OpenShift. Customers need to know the explicit list of permissions minimally needed for deploying and running OpenShift and what they're used for so they can request the right permissions. Without this information, it can/will block adoption of OpenShift 4 in many cases.

Scenarios

  1. ...

Acceptance Criteria

  • Document explicit list of required credential permissions for installing (Day 1) OpenShift on Azure using the IPI and UPI deployment workflows and what each of the permissions are used for.
  • Document explicit list of required role and credential permissions for the operation (Day 2) of an OpenShift cluster on Azure and what each of the permissions are used for
  • Verify minimum list of permissions for Azure with IPI and UPI installation workflows
  • (Day 2) operations of OpenShift on Azure - MUST complete successfully with automated tests
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. Installer [both UPI & IPI Workflows]
  2. Control Plane
    • Kube Controller Manager
  3. Compute [Managed Identity]
  4. Cloud API enabled components
    • Cloud Credential Operator
    • Machine API
    • Internal Registry
    • Ingress
  5. ?
  6.  

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

 

User Story

As a cluster admin, I want the CCM and Node manager to utilize credentials generated by CCO so that the permissions granted to the identity can be scoped with least privilege on clusters utilizing Azure AD Workload Identity.

Background

The Cloud Controller Manager Operator creates a CredentialsRequest as part of CVO manifests which describes credentials that should be created for the CCM and Node manager to utilize. CCM and the Node Manager do not use the credentials created as a product of the CredentialsRequest in existing "passthrough" based Azure clusters or within Azure AD Workload Identity based Azure clusters. CCM and the Node Manager instead use a system-assigned identity which is attached to the Azure cluster VMs.

The system-assigned identity attached to the VMs is granted the "Contributor" role within the cluster's Azure resource group. In order to use the system-assigned identity, a pod must have sufficient privilege to use the host network to contact the Azure instance metadata service (IMDS). 

For Azure AD Workload Identity based clusters, administrators must process the CredentialsRequests extracted from the release image which includes the CredentialsRequest from CCCMO manifests. This CredentialsRequest processing results in the creation of a user-assigned managed identity which is not utilized by the cluster. Additionally, the permissions granted to the identity are currently scoped broadly to grant the "Contributor" role within the cluster's Azure resource group. If the CCM and Node Manager were to utilize the identity then we could scope the permissions granted to the identity to be more granular. It may be confusing to administrators to need to create this unused user-assigned managed identity with broad permissions access.

Steps

  • Modify CCM and Node manager deployments to use the CCCMO's Azure credentials injector as an init-container to merge the provided CCO credentials secret with the /etc/kube/cloud.conf file used to configure cloud-provider-azure as used within CCM and the Node Manager. An example of the init-container can be found within the azure-file-csi-driver-operator.
  • Validate that the provided credentials are used by CCM and the Node Manager and that they continue to operate normally.
  • Scope permissions specified in the CCCMO CredentialsRequest to only those permissions needed for operation rather than "Contributor" within the Azure resource group.

Stakeholders

  • <Who is interested in this/where did they request this>

Definition of Done

  • CCM and Node Manager use credentials provided by CCO rather than the system-assigned identity attached to the VMs.
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • e2e tests validate that the CCM and Node manager operate normally with the credentials provided by CCO.

User Story

As a [user|developer|<other>] I want [some goal] so that [some reason]

<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it?>

Background

<Describes the context or background related to this story>

Steps

  • <Add steps to complete this card if appropriate>

Stakeholders

  • <Who is interested in this/where did they request this>

Definition of Done

  • <Add items that need to be completed for this card>
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Add a new field (DataPermissions) to the Azure Credentials Request CR, and plumb it into the custom role assigned to the generated user-assigned managed identity's data actions.

Add actuator code to satisfy permissions specified in 'Permissions' API field. The implementation should create a new custom role with specified permissions and assign it to the generated user-assigned managed identity along with the predefined roles enumerated in CredReq.RoleBindings. The role we create for the CredentialsRequest should be discoverable so that it can be idempotently updated on re-invocation of ccoctl.

Questions to answer based on lessons learned from custom roles in GCP, assuming that we will create one custom role per identity,

  • Does Azure have soft/hard role deletion? ie. are custom roles retained for some period following deletion and if so do deleted roles count towards quota?
  • What is the default quota limitation for custom roles in Azure?
  • Does it make sense to create a custom role for each identity created based on quota limitations?
    • If it doesn't make sense, how can the roles be condensed to satisfy the quota limitations?

User Story

As a [user|developer|<other>] I want [some goal] so that [some reason]

<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it?>

Background

<Describes the context or background related to this story>

Steps

  • <Add steps to complete this card if appropriate>

Stakeholders

  • <Who is interested in this/where did they request this>

Definition of Done

  • <Add items that need to be completed for this card>
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Feature Overview

RHEL CoreOS should be updated to RHEL 9.2 sources to take advantage of newer features, hardware support, and performance improvements.

 

Requirements

  • RHEL 9.x sources for RHCOS builds starting with OCP 4.13 and RHEL 9.2.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

  • 9.2 Preview via Layering No longer necessary assuming we stay the course of going all in on 9.2

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic Goal

  • The Kernel API was updated for RHEL 9, so the old approach of setting the `sched_domain` in `/sys/kernel` is no longer available. Instead, cgroups have to be worked with directly.
  • Both CRI-O and PAO need to be updated to set the cpuset of containers and other processes correctly, as well as set the correct value for sched_load_balance

Why is this important?

  • CPU load balancing is a vital piece of real time execution for processes that need exclusive access to a CPU. Without this, CPU load balancing won't work on RHEL 9 with Openshift 4.13

Scenarios

  1. As a developer on Openshift, I expect my pods to run with exclusive CPUs if I set the PAO configuration correctly

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Part of setting CPU load balancing on RHEL 9 involves disabling sched_load_balance on cgroups that contain a cpuset that should be exclusive. The PAO may be required to be responsible for this piece

This is the Epic to track the work to add RHCOS 9 in OCP 4.13 and to make OCP use it by default.

 

CURRENT STATUS: Landed in 4.14 and 4.13

 

Testing with layering

 

Another option given an existing e.g. 4.12 cluster is to use layering.  First, get a digested pull spec for the current build:

$ skopeo inspect --format "{{.Name}}@{{.Digest}}" -n docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev:4.13-9.2
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b4cc3995d5fc11e3b22140d8f2f91f78834e86a210325cbf0525a62725f8e099

Create a MachineConfig that looks like this:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: worker-override
spec:
  osImageURL: <digested pull spec>

If you want to also override the control plane, create a similar one for the master role.
 
We don't yet have auto-generated release images. However, if you want one, you can ask cluster bot to e.g. "launch https://github.com/openshift/machine-config-operator/pull/3485" with options you want (e.g. "azure" etc.) or just "build https://github.com/openshift/machine-config-operator/pull/3485" to get a release image.

STATUS:  Code is merged for 4.13 and is believed to largely solve the problem.

 


 

Description of problem:

Upgrades to from OpenShift 4.12 to 4.13 will also upgrade the underlying RHCOS from 8.6 to 9.2. As part of that the names of the network interfaces may change. For example `eno1` may be renamed to `eno1np0`. If a host is using NetworkManager configuration files that rely on those names then the host will fail to connect to the network when it boots after the upgrade. For example, if the host had static IP addresses assigned it will instead boot using IP addresses assigned via DHCP.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always.

Steps to Reproduce:

1. Select hardware (or VMs) that will have different network interface names in RHCOS 8 and RHCOS 9, for example `eno1` in RHCOS 8 and `eno1np0` in RHCOS 9.

1. Install a 4.12 cluster with static network configuration using the `interface-name` field of NetworkManager interface configuration files to match the configuration to the network interface.

2. Upgrade the cluster to 4.13.

Actual results:

The NetworkManager configuration files are ignored because they don't longer match the NIC names. Instead the NICs get new IP addresses from DHCP.

Expected results:

The NetworkManager configuration files are updated as part of the upgrade to use the new NIC names.

Additional info:

Note this a hypothetical scenario. We have detected this potential problem in a slightly different scenario where we install a 4.13 cluster with the assisted installer. During the discovery phase we use RHCOS 8 and we generate the NetworkManager configuration files. Then we reboot into RHCOS 9, and the configuration files are ignored due to the change in the NICs. See MGMT-13970 for more details.

BU Priority Overview

Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) with VMs

Goals

  • Enable installation of OpenShift 4 on Oracle Cloud Infrastructure (OCI) with VMs using platform agnostics with Assisted Installer.
  • OpenShift 4 on OCI (with VMs) can be updated that results in a cluster and applications that are in a healthy state when update is completed.
  • Telemetry reports back on clusters using OpenShift 4 on OCI for connected OpenShift clusters (e.g. platform=none using Oracle CSI).

State of the Business

Currently, we don't yet support OpenShift 4 on Oracle Cloud Infrastructure (OCI), and we know from initial attempts that installing OpenShift on OCI requires the use of a qcow (OpenStack qcow seems to work fine), networking and routing changes, storage issues, potential MTU and registry issues, etc.

Execution Plans

TBD based on customer demand.

 

Why is this important

  • OCI is starting to gain momentum.
  • In the Middle East (e.g. Saudi Arabia), only OCI and Alibaba Cloud are approved hyperscalars.

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

RFEs:

  • RFE-3635 - Supporting Openshift on Oracle Cloud Infrastructure(OCI) & Oracle Private Cloud Appliance (PCA)

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

Other

 

 

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of the problem:

 I've tested a cluster with platform type = 'baremetal' and hosts discovered. Then, when I try to change to Nutanix platform, BE returns an error

How reproducible:

100% 

Steps to reproduce:

1. Create cluster without platform integration

2. Discover 3 hosts

3. Try to change platform to 'Nutanix'

Actual results:

API returns an error.

Expected results:
We can change platform type, this change should be agnostic to the discovered hosts.

External platform will be available behind TechPreviewNoUpgrade feature set, automatically enable this falg in the installer config when oci platform is selected.

There are 2 options to detect if the hosts are running on OCI:

1/ On OCI, the machine will have the following chassis-asset-tag:

# dmidecode --string chassis-asset-tag
OracleCloud.com

In the agent, we can override hostInventory.SystemVendor.Manufacturer when chassis-asset-tag="OracleCloud.com".

2/  Read instance metadata: curl -v -H "Authorization: Bearer Oracle"  http://169.254.169.254/opc/v2/instance

It will allow the auto-detection of the platform from the provider in assisted-service, and validate that hosts are running in OCI when installing a cluster with platform=oci

Description of the problem:
The features API tells us that EXTERNAL_PLATFORM_OCI is supported for version 4.14 and the s390x cpu architecture but the attempt to create the cluster fails with "Can't set oci platform on s390x architecture"
 

 

Steps to reproduce:

1. Register cluster with OCI platform and z architecture

 

 Currently the  API  call "GET /v2/clusters/{cluster_id}/supported-platforms" returns the hosts supported platforms regardless of the other cluster parameters

We currently rely on a hack to deploy a cluster on external platform: https://github.com/openshift/assisted-service/pull/5312

The goal of this ticket is to move the definition of the external platform in in the installer-config on the openshift installer is released with the support of external platform: https://github.com/openshift/installer/pull/7217

Description of the problem:

Currently, the infrastructure object is create as following:

 # oc get infrastructure/cluster -oyaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2023-06-19T13:49:07Z"
  generation: 1
  name: cluster
  resourceVersion: "553"
  uid: 240dc176-566e-4471-b9db-fb25c676ba33
spec:
  cloudConfig:
    name: ""
  platformSpec:
    type: None
status:
  apiServerInternalURI: https://api-int.test-infra-cluster-97ef21c5.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443
  apiServerURL: https://api.test-infra-cluster-97ef21c5.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443
  controlPlaneTopology: HighlyAvailable
  cpuPartitioning: None
  etcdDiscoveryDomain: ""
  infrastructureName: test-infra-cluster-97-w6b42
  infrastructureTopology: HighlyAvailable
  platform: None
  platformStatus:
    type: None

instead it should be similar to:

# oc get infrastructure/cluster -oyaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2023-06-19T13:49:07Z"
  generation: 1
  name: cluster
  resourceVersion: "553"
  uid: 240dc176-566e-4471-b9db-fb25c676ba33
spec:
  cloudConfig:
    name: ""
  platformSpec:
    type: External
    external:
      platformName: oci
status:
  apiServerInternalURI: https://api-int.test-infra-cluster-97ef21c5.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443
  apiServerURL: https://api.test-infra-cluster-97ef21c5.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443
  controlPlaneTopology: HighlyAvailable
  cpuPartitioning: None
  etcdDiscoveryDomain: ""
  infrastructureName: test-infra-cluster-97-w6b42
  infrastructureTopology: HighlyAvailable
  platform: External
  platformStatus:
    type: External
    external:
      cloudControllerManager:
        state: External

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

The taint here: https://github.com/openshift/assisted-installer/pull/629/files#diff-1046cc2d18cf5f82336bbad36a2d28540606e1c6aaa0b5073c545301ef60ffd4R593

should only be removed when platform is nutanix or vsphere because the credentials for these platforms are passed after cluster installation.

In the opposite with Oracle Cloud the instance gets its credentials through the instance metadata, and should be able to label the nodes from the beginning of the installation without any user intervention.

In order to install oracle CCM driver, we need the ability to set the platform to "external" in the install-config.

The platform need to be added here: https://github.com/openshift/assisted-service/blob/3496d1d2e185343c6a3b1175c810fdfd148229b2/internal/installcfg/installcfg.go#L8

Slack thread: https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1678801176091619

The goal of this ticket is to check if besides external platform, the AI can install the CCM, and document it.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Create a new platform type, working name "External", that will signify when a cluster is deployed on a partner infrastructure where core cluster components have been replaced by the partner. “External” is different from our current platform types in that it will signal that the infrastructure is specifically not “None” or any of the known providers (eg AWS, GCP, etc). This will allow infrastructure partners to clearly designate when their OpenShift deployments contain components that replace the core Red Hat components.

This work will require updates to the core OpenShift API repository to add the new platform type, and then a distribution of this change to all components that use the platform type information. For components that partners might replace, per-component action will need to be taken, with the project team's guidance, to ensure that the component properly handles the "External" platform. These changes will look slightly different for each component.

To integrate these changes more easily into OpenShift, it is possible to take a multi-phase approach which could be spread over a release boundary (eg phase 1 is done in 4.X, phase 2 is done in 4.X+1).

Phase 1

  • Write platform “External” enhancement.
  • Evaluate changes to cluster capability annotations to ensure coverage for all replaceable components.
  • Meet with component teams to plan specific changes that will allow for supplement or replacement under platform "External".

Phase 2

  • Update OpenShift API with new platform and ensure all components have updated dependencies.
  • Update capabilities API to include coverage for all replaceable components.
  • Ensure all Red Hat operators tolerate the "External" platform and treat it the same as "None" platform.

Phase 3

  • Update components based on identified changes from phase 1
    • Update Machine API operator to run core controllers in platform "External" mode.

Why is this important?

  • As partners begin to supplement OpenShift's core functionality with their own platform specific components, having a way to recognize clusters that are in this state helps Red Hat created components to know when they should expect their functionality to be replaced or supplemented. Adding a new platform type is a significant data point that will allow Red Hat components to understand the cluster configuration and make any specific adjustments to their operation while a partner's component may be performing a similar duty.
  • The new platform type also helps with support to give a clear signal that a cluster has modifications to its core components that might require additional interaction with the partner instead of Red Hat. When combined with the cluster capabilities configuration, the platform "External" can be used to positively identify when a cluster is being supplemented by a partner, and which components are being supplemented or replaced.

Scenarios

  1. A partner wishes to replace the Machine controller with a custom version that they have written for their infrastructure. Setting the platform to "External" and advertising the Machine API capability gives a clear signal to the Red Hat created Machine API components that they should start the infrastructure generic controllers but not start a Machine controller.
  2. A partner wishes to add their own Cloud Controller Manager (CCM) written for their infrastructure. Setting the platform to "External" and advertising the CCM capability gives a clear to the Red Hat created CCM operator that the cluster should be configured for an external CCM that will be managed outside the operator. Although the Red Hat operator will not provide this functionality, it will configure the cluster to expect a CCM.

Acceptance Criteria

Phase 1

  • Partners can read "External" platform enhancement and plan for their platform integrations.
  • Teams can view jira cards for component changes and capability updates and plan their work as appropriate.

Phase 2

  • Components running in cluster can detect the “External” platform through the Infrastructure config API
  • Components running in cluster react to “External” platform as if it is “None” platform
  • Partners can disable any of the platform specific components through the capabilities API

Phase 3

  • Components running in cluster react to the “External” platform based on their function.
    • for example, the Machine API Operator needs to run a set of controllers that are platform agnostic when running in platform “External” mode.
    • the specific component reactions are difficult to predict currently, this criteria could change based on the output of phase 1.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. Identifying OpenShift Components for Install Flexibility

Open questions::

  1. Phase 1 requires talking with several component teams, the specific action that will be needed will depend on the needs of the specific component. At the least the components need to treat platform "External" as "None", but there could be more changes depending on the component (eg Machine API Operator running non-platform specific controllers).

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • As defined in the part (OCPBU-5), this epic is about adding the new "External" platform type and ensuring that the OpenShift operators which react to platform types treat the "External" platform as if it were a "None" platform.
  • Add an end-to-end test to exercise the "External" platform type

Why is this important?

  • This work lays the foundation for partners and users to customize OpenShift installations that might replace infrastructure level components.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

As described in the external platform enhancement , the machine-api-operator should be modified to react to the external platform type in the same manner as platform none.

Steps

  • add an extra clause to the platform switch that will group "External" with "None"

Stakeholders

  • openshift eng

Definition of Done

  • MAO behaves as if platform None when External is selected
  • Docs
  • developer docs for MAO should be updated
  • Testing

Background

As described in the external platform enhancement , the cluster-cloud-controller-manager-opeartor should be modified to react to the external platform type in the same manner as platform none.

Steps

  • add an extra clause to the platform switch that will group "External" with "None"

Stakeholders

  • openshift eng

Definition of Done

  • CCCMO behaves as if platform None when External is selected
  • Docs
  • developer docs for CCCMO should be updated
  • Testing

This feature is the place holder for all epics related to technical debt associated with Console team 

Outcome Overview

Once all Features and/or Initiatives in this Outcome are complete, what tangible, incremental, and (ideally) measurable movement will be made toward the company's Strategic Goal(s)?

 

Success Criteria

What is the success criteria for this strategic outcome?  Avoid listing Features or Initiatives and instead describe "what must be true" for the outcome to be considered delivered.

 

 

Expected Results (what, how, when)

What incremental impact do you expect to create toward the company's Strategic Goals by delivering this outcome?  (possible examples:  unblocking sales, shifts in product metrics, etc. + provide links to metrics that will be used post-completion for review & pivot decisions). {}For each expected result, list what you will measure and when you will measure it (ex. provide links to existing information or metrics that will be used post-completion for review and specify when you will review the measurement such as 60 days after the work is complete)

 

 

Post Completion Review – Actual Results

After completing the work (as determined by the "when" in Expected Results above), list the actual results observed / measured during Post Completion review(s).

 

Feature Overview

Create a Azure cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in Azure) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

 
Goals

  • Functionality on Azure Tech Preview
  • inclusion in the cluster backups
  • flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

This is continuation of CORS-2249 / CFE-671 work, where support for Azure tags was delivered as TechPreview in 4.13 and to make it GA in 4.14. It would involve removing any reference to TechPreview in code and doc and to incorporate any feedback received from the users.

Remove code references related to Azure Tags is for TechPreview in below list

  • installer/data/data/install.openshift.io_installconfigs.yaml (PR#6820)
  • installer/pkg/explain/printer_test.go (PR#6820)
  • installer/pkg/types/azure/platform.go (PR#6820)
  • installer/pkg/types/validation/installconfig.go (PR#6820)

The Control Plane MachineSet enables OCP clusters to scale Control plane machines. This epic is about making the Control Plane MachineSet controller work with OpenStack.

Goal

  • The control plane nodes can be scaled up and down, lost and recovered.

Why is this important?

  • The procedure to recover from a failed control plane node and to add new nodes is lengthy. In order to increase the scale flexibility, a more simple mechanism needs to be supported.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://docs.openshift.com/container-platform/4.12/machine_management/control_plane_machine_management/cpmso-about.html

The Control Plane MachineSet enables OCP clusters to scale Control plane machines. This epic is about making the Control Plane MachineSet controller work with OpenStack.

Goal

  • The control plane nodes can be scaled up and down, lost and recovered.

Why is this important?

  • The procedure to recover from a failed control plane node and to add new nodes is lengthy. In order to increase the scale flexibility, a more simple mechanism needs to be supported.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://docs.openshift.com/container-platform/4.12/machine_management/control_plane_machine_management/cpmso-about.html

The FailureDomain API that was introduced in 4.13 was TechPreview and is now replaced by an API in openshift/api; not in the installer anymore.

 

Therefore, we want to clean up the installer from any unsupported API so later we can add the supported API in order to add support for CPMS on OpenStack.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Create a severity warning alert to alert to admin that there is packet loss occurring due to failed ovs vswitchd lookups. This may occur if vswitchd is cpu constrained and there are also numerous lookups.

Use metric  ovs_vswitchd_netlink_overflow which shows netlink messages dropped by the vswitchd daemon due to buffer overflow in userspace.

For the kernel equivalent, use metric ovs_vswitchd_dp_flows_lookup_lost . Both metrics usually have the same value but may differ if vswitchd may restart.

Both these metrics should be aggregate into a single alert if the value has increased recently.

 

DoD: QE test case, code merged to CNO, metrics document updated ( https://docs.google.com/document/d/1lItYV0tTt5-ivX77izb1KuzN9S8-7YgO9ndlhATaVUg/edit )

< High-Level description of the feature ie: Executive Summary >

Goals

< Who benefits from this feature, and how? What is the difference between today’s current state and a world with this feature? >

Requirements

Requirements Notes IS MVP
     
    • (Optional) Use Cases

< What are we making, for who, and why/what problem are we solving?>

Out of scope

<Defines what is not included in this story>

Dependencies

< Link or at least explain any known dependencies. >

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

What does success look like?

< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>

QE Contact

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Impact

< If the feature is ordered with other work, state the impact of this feature on the other work>

Related Architecture/Technical Documents

<links>

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in production environment

Problem:

There's no way in the UI for the cluster admin to

  • change the default timeout period for the Web Terminal for all users
  • select an image from an image repository to be used as the default image for the Web Terminal for all users

Goal:

Expose the ability for cluster admins to provide customization for all web terminal users through the UI which is available in wtoctl

Why is it important?

Acceptance criteria:

  1. Cluster admin should be able to change the default timeout period for all new instances of the Web Terminal (it won't change settings)
  2. Cluster admin should be able to provide a new image as the default image for all new instances of the Web Terminal (it won't change settings)

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Questions:

  • Where will this information be shared?
  • What CLI is used to accomplish this today? Get link to docs

Description

Allow cluster admin to provide default image and/or timeout period for all cluster users

Acceptance Criteria

    <