Reinforcement Learning
Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment. The agent takes actions, receives feedback in the form of rewards or penalties, and learns a policy that maximizes cumulative reward over time.
Unlike supervised learning (which requires labeled examples), RL learns from trial and error. This makes it ideal for additive manufacturing applications where optimal parameters must be discovered through experimentation, such as real-time process control and parameter optimization.
Core Concept
┌─────────────────────────────────┐
│ ENVIRONMENT │
│ (3D Printer / Simulation) │
└───────────┬─────────────────────┘
│
State (sₜ) │ Reward (rₜ)
(temp, speed,...) │ (part quality)
┌──────────┴──────────┐
│ │
▼ │
┌───────────┐ │
│ AGENT │◀──────────────┘
│ (Policy) │
└─────┬─────┘
│
▼ Action (aₜ)
(adjust temp, speed)
Key Components
| Component | Definition | AM Example |
|---|---|---|
| Agent | The learner/decision-maker | Control algorithm for printer |
| Environment | Everything the agent interacts with | 3D printer + material + sensors |
| State (s) | Current situation observed by agent | Temperature, layer height, defect count |
| Action (a) | Decision made by agent | Increase speed, lower temperature |
| Reward (r) | Feedback signal (scalar) | +1 for good layer, -10 for defect |
| Policy (π) | Strategy mapping states to actions | π(s) = "if defect detected, reduce speed" |
Algorithms
Q-Learning
Learns a Q-function Q(s,a) that estimates the expected total reward for taking action a in state s. Updates using the Bellman equation. Works for discrete state/action spaces.
Deep Q-Network (DQN)
Uses a neural network to approximate the Q-function, enabling RL on high-dimensional states (like images). Introduced experience replay and target networks for stability.
Policy Gradient (REINFORCE)
Directly optimizes the policy parameters using gradient ascent on expected reward. Works with continuous action spaces. Often combined with a value function baseline.
Actor-Critic (A2C, PPO)
Combines policy gradient (actor) with value function (critic). PPO (Proximal Policy Optimization) is widely used for its stability and performance. Good for robotics and control.
Applications in Additive Manufacturing
RL agents can learn to adjust print parameters (temperature, speed, flow rate) in real-time based on sensor feedback. When defects are detected, the agent learns corrective actions that minimize scrap while maintaining throughput.
Instead of exhaustive grid search or trial-and-error, RL explores the parameter space efficiently. The agent learns which parameter combinations produce the best parts with minimal experiments—crucial when materials are expensive.
The DTU TopOpt group developed SOgym, an RL environment for topology optimization. An RL agent learns to place material iteratively, potentially discovering solutions faster than traditional SIMP methods.
Typical RL Setup for AM
- State: Current layer, temperature readings, previous defect history, material properties
- Actions: Discrete (increase/decrease speed) or continuous (set speed to X mm/s)
- Reward: Part quality metrics, +bonus for completion, -penalty for defects/time
- Episode: One complete print job
Challenges
Sample Efficiency
RL often requires millions of interactions to learn. In AM, each "experiment" takes hours and costs materials. Solutions: simulation (digital twins), transfer learning, model-based RL.
Sim-to-Real Gap
Policies learned in simulation may not transfer perfectly to real printers. Domain randomization and careful simulator calibration help bridge this gap.
Reward Design
Defining the right reward function is critical. Too sparse (only reward at end) makes learning slow. Too dense (reward every action) may cause unintended behaviors.
See Also
- Machine Learning — Overview of ML concepts
- CNN — Often used with RL for visual input
- Process Optimization — RL for AM control
References
- Sutton, R.S. & Barto, A.G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free online
- Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529-533.
- Schulman, J., et al. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
- Dittmer, S., et al. (2023). SOgym: A reinforcement learning environment for topology optimization. TU Denmark.