Reinforcement Learning

Type Learning paradigm

Basis Markov Decision Process

Goal Maximize cumulative reward

Key Concepts Agent, Environment, Policy, Reward

AM Uses Process optimization, adaptive control

Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment. The agent takes actions, receives feedback in the form of rewards or penalties, and learns a policy that maximizes cumulative reward over time.

Unlike supervised learning (which requires labeled examples), RL learns from trial and error. This makes it ideal for additive manufacturing applications where optimal parameters must be discovered through experimentation, such as real-time process control and parameter optimization.

Contents

Core Concept
Key Components
Algorithms
Applications in AM
Challenges
References

Core Concept

                    ┌─────────────────────────────────┐
                    │         ENVIRONMENT             │
                    │   (3D Printer / Simulation)     │
                    └───────────┬─────────────────────┘
                                │
              State (sₜ)        │        Reward (rₜ)
         (temp, speed,...)     │     (part quality)
                    ┌──────────┴──────────┐
                    │                     │
                    ▼                     │
              ┌───────────┐               │
              │   AGENT   │◀──────────────┘
              │  (Policy) │
              └─────┬─────┘
                    │
                    ▼ Action (aₜ)
              (adjust temp, speed)

The RL loop: At each timestep, the agent observes the current state, takes an action according to its policy, receives a reward, and observes the new state. The goal is to learn a policy that maximizes the total reward over an episode.

Key Components

Component	Definition	AM Example
Agent	The learner/decision-maker	Control algorithm for printer
Environment	Everything the agent interacts with	3D printer + material + sensors
State (s)	Current situation observed by agent	Temperature, layer height, defect count
Action (a)	Decision made by agent	Increase speed, lower temperature
Reward (r)	Feedback signal (scalar)	+1 for good layer, -10 for defect
Policy (π)	Strategy mapping states to actions	π(s) = "if defect detected, reduce speed"

Algorithms

Q-Learning

Learns a Q-function Q(s,a) that estimates the expected total reward for taking action a in state s. Updates using the Bellman equation. Works for discrete state/action spaces.

Deep Q-Network (DQN)

Uses a neural network to approximate the Q-function, enabling RL on high-dimensional states (like images). Introduced experience replay and target networks for stability.

Policy Gradient (REINFORCE)

Directly optimizes the policy parameters using gradient ascent on expected reward. Works with continuous action spaces. Often combined with a value function baseline.

Actor-Critic (A2C, PPO)

Combines policy gradient (actor) with value function (critic). PPO (Proximal Policy Optimization) is widely used for its stability and performance. Good for robotics and control.

Applications in Additive Manufacturing

Adaptive Process Control:
RL agents can learn to adjust print parameters (temperature, speed, flow rate) in real-time based on sensor feedback. When defects are detected, the agent learns corrective actions that minimize scrap while maintaining throughput.

Parameter Optimization:
Instead of exhaustive grid search or trial-and-error, RL explores the parameter space efficiently. The agent learns which parameter combinations produce the best parts with minimal experiments—crucial when materials are expensive.

Topology Optimization (SOgym):
The DTU TopOpt group developed SOgym, an RL environment for topology optimization. An RL agent learns to place material iteratively, potentially discovering solutions faster than traditional SIMP methods.

Typical RL Setup for AM

State: Current layer, temperature readings, previous defect history, material properties
Actions: Discrete (increase/decrease speed) or continuous (set speed to X mm/s)
Reward: Part quality metrics, +bonus for completion, -penalty for defects/time
Episode: One complete print job

Challenges

Sample Efficiency

RL often requires millions of interactions to learn. In AM, each "experiment" takes hours and costs materials. Solutions: simulation (digital twins), transfer learning, model-based RL.

Sim-to-Real Gap

Policies learned in simulation may not transfer perfectly to real printers. Domain randomization and careful simulator calibration help bridge this gap.

Reward Design

Defining the right reward function is critical. Too sparse (only reward at end) makes learning slow. Too dense (reward every action) may cause unintended behaviors.

References

Sutton, R.S. & Barto, A.G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free online
Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529-533.
Schulman, J., et al. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
Dittmer, S., et al. (2023). SOgym: A reinforcement learning environment for topology optimization. TU Denmark.

wik.ai