10.2.3 Modern Classification Architectures
Learning Objectives
Section titled “Learning Objectives”- Understand what problems several generations of mainstream image classification architectures are trying to solve
- Understand why residual connections changed deep network training
- Understand why efficiency-focused architectures matter
- Build basic judgment for architecture selection
First, Build a Map
Section titled “First, Build a Map”If you just finished data augmentation, the most natural next step is:
- The previous section focused on “how to feed the same image to the model in a more stable way”
- This section starts to solve “how the model backbone itself should be designed to be stronger, more stable, and more efficient”
So this section is not about memorizing architecture names by themselves. It fills in the other half of image classification:
- How data is prepared
- How the network is built in a reasonable way
For beginners, the best way to understand modern classification architectures is not to “look at a list of names,” but to first see what problems the evolution of architectures is answering:
flowchart LR A["VGG: make networks deeper first"] --> B["ResNet: make deep networks trainable"] B --> C["EfficientNet: start caring about efficiency"] C --> D["ConvNeXt: reorganize the convolutional path"]So what this section really wants to address is:
- Why image classification networks keep evolving
- What kinds of bottlenecks different architectures are trying to fix
A Better Analogy for Beginners
Section titled “A Better Analogy for Beginners”You can understand the evolution of classification architectures as:
- A factory assembly line being upgraded again and again
Each upgrade is not about making the machine look fancier, but about answering very practical questions:
- Can the line be made longer?
- Will the machines become unstable as they run?
- Can we produce more with the same electricity bill?
Why Do Image Classification Architectures Keep Evolving?
Section titled “Why Do Image Classification Architectures Keep Evolving?”Because “deeper” does not automatically mean “better”
Section titled “Because “deeper” does not automatically mean “better””In early networks, once they became deeper, they often ran into:
- Hard-to-propagate gradients
- Optimization difficulties
- Unstable training
So later evolution is essentially answering two questions
Section titled “So later evolution is essentially answering two questions”- How can deep networks be trained better?
- How can performance and efficiency be balanced?
An Analogy
Section titled “An Analogy”Architecture evolution is like continually upgrading an assembly line:
- Not to make the machine more ornate
- But to keep it working reliably at larger production scales
When Learning This Section for the First Time, What Should You Focus On?
Section titled “When Learning This Section for the First Time, What Should You Focus On?”What matters most is not the model year or leaderboard, but this sentence:
The essence of architecture evolution is solving “how to train deeper models, how to make them stronger with less cost, and how to keep them stable in a modern way.”
Once that idea is clear, whenever you see a new architecture, you’ll naturally ask:
- What bottleneck is it mainly addressing?
- Is it solving depth, stability, or efficiency?
What Did Different Generations of Architectures Focus On?
Section titled “What Did Different Generations of Architectures Focus On?”VGG: First Make “Going Deeper” Work
Section titled “VGG: First Make “Going Deeper” Work”Features:
- Regular structure
- All small convolutions
- Deeper networks
Its significance is:
- It proved that deeper networks can significantly improve capability
ResNet: Make Very Deep Networks Actually Trainable
Section titled “ResNet: Make Very Deep Networks Actually Trainable”The core intuition of residual connections is:
- Each layer does not need to learn a completely new transformation
- Instead, it learns an increment on top of the existing representation
This greatly improves training stability in deep networks.
Why Did ResNet Become One of the Most Important Turning Points in Image Classification?
Section titled “Why Did ResNet Become One of the Most Important Turning Points in Image Classification?”Because it was the first architecture to systematically solve a very important problem:
- Networks want to become deeper
- But deeper networks are hard to train
The significance of ResNet is not just “the score got better.” It connected:
- Deeper networks
- Trainability
These two things were finally brought together.
EfficientNet: Start Taking Compute Efficiency Seriously
Section titled “EfficientNet: Start Taking Compute Efficiency Seriously”It does not only ask, “Can it be stronger?” It also asks:
- How can we get better value with the same budget?
ConvNeXt: Re-examining the Convolutional Family
Section titled “ConvNeXt: Re-examining the Convolutional Family”After Transformer became dominant, the convolutional path also began to be reorganized and modernized.
This shows:
- Architecture evolution is not a one-way replacement
A More Beginner-Friendly Architecture Comparison Table
Section titled “A More Beginner-Friendly Architecture Comparison Table”| Architecture | First idea | Intuition it builds |
|---|---|---|
| VGG | Deep, regular, easy to understand | Why deeper networks can be stronger |
| ResNet | Residual connections | How deep networks train more stably |
| EfficientNet | Performance and efficiency together | Why accuracy is not the only thing that matters |
| ConvNeXt | Convolutions can still be modernized | Architecture is not a simple new-vs-old binary |
Where Do Beginners Most Easily Go Wrong When Learning Architecture Evolution?
Section titled “Where Do Beginners Most Easily Go Wrong When Learning Architecture Evolution?”It is easiest to turn it into:
- A bunch of model names
- A bunch of layer counts
- A bunch of leaderboard conclusions
But a more valuable way to learn is:
- First ask about the motivation
- Then ask about the structural changes
- Finally ask whether it is worth using as a baseline in a project
If You Put Them Into “Project Selection,” How Should You Understand Them?
Section titled “If You Put Them Into “Project Selection,” How Should You Understand Them?”A more practical way to remember them is:
VGG: more like a classic teaching starting point, good for building a sense of “depth and structure”ResNet: the most reliable engineering baseline; in many projects, people still reach for it firstEfficientNet: especially valuable when you start caring about “more value for the same resources”ConvNeXt: more suitable when you want to understand how “the convolutional family can still be modernized”
A Practical Architecture Selection Table for Beginners
Section titled “A Practical Architecture Selection Table for Beginners”| Your goal | Safer first choice |
|---|---|
| Your first image classification project | ResNet |
| You want to understand why deep networks can be trained stably | ResNet |
| You want both performance and efficiency | EfficientNet |
| You want to study the evolution of visual architectures | VGG -> ResNet -> ConvNeXt |
This table is useful for beginners because it turns “model names” back into “when should I think of this first?”

Build Intuition with a Minimal Residual Example First
Section titled “Build Intuition with a Minimal Residual Example First”def block_without_residual(x): transformed = x * 0.6 + 0.2 return transformed
def block_with_residual(x): transformed = x * 0.6 + 0.2 return x + transformed
x = 1.0print("without residual:", block_without_residual(x))print("with residual :", block_with_residual(x))Expected output:
without residual: 0.8with residual : 1.8What Is This Example Trying to Show?
Section titled “What Is This Example Trying to Show?”You can first understand residual connections as:
- Not completely replacing old information
- But adding a new correction on top of the old information
This is very important for training deep networks.
Why Does This Relate to “Deeper but More Stable”?
Section titled “Why Does This Relate to “Deeper but More Stable”?”Because when the network is very deep, fully rewriting representations is harder to learn than “gradually refining” them layer by layer.
When Seeing Residual Connections for the First Time, the Most Important Thing to Remember Is Not the Formula, but “Keeping the Original Path”
Section titled “When Seeing Residual Connections for the First Time, the Most Important Thing to Remember Is Not the Formula, but “Keeping the Original Path””You can first think of a residual block as:
- The new branch learning corrections
- The original path preserving existing information
This makes it easier to understand why it helps deep networks:
- Information does not need to be forcibly rewritten at every layer
- Gradients can also flow back more easily
What Should Beginners Remember First When Learning This Section?
Section titled “What Should Beginners Remember First When Learning This Section?”The most important things to remember are:
- Architecture evolution is not “new models endlessly replacing old ones”
- Many improvements are about training stability and efficiency
- ResNet matters not only because it is strong, but because it made “deeper networks can still be trained” a reality
How Should You Choose Modern Classification Architectures?
Section titled “How Should You Choose Modern Classification Architectures?”If You Are a Beginner Building a Baseline
Section titled “If You Are a Beginner Building a Baseline”Prioritize:
- Classic strong baselines such as ResNet
If You Are Very Sensitive to Resources
Section titled “If You Are Very Sensitive to Resources”You should pay more attention to:
- EfficientNet
- Lighter-weight convolutional architectures
If You Are Doing Research or Strong Performance Comparisons
Section titled “If You Are Doing Research or Strong Performance Comparisons”Only then is it worth systematically comparing different families.
When Doing Your First Image Classification Project, How Can You Make a Safer Choice?
Section titled “When Doing Your First Image Classification Project, How Can You Make a Safer Choice?”A stable sequence is usually:
- Use ResNet first to establish a strong baseline
- If resources are truly limited, then look at efficiency-focused approaches such as EfficientNet
- If you really want a deeper comparison, then study more families
This is usually better than chasing the trendiest architecture from the start.
A Practical Selection Table for Beginners
Section titled “A Practical Selection Table for Beginners”| Your scenario | Safer first choice |
|---|---|
| Your first image classification project | ResNet |
| Device resources are limited | EfficientNet or a lightweight convolutional network |
| You want to understand the evolution of convolutional systems | VGG -> ResNet -> ConvNeXt |
| You want to do a more serious architecture comparison | Systematically compare different families |
The Safest Default Sequence When Putting Architectures into a Project for the First Time
Section titled “The Safest Default Sequence When Putting Architectures into a Project for the First Time”A safer sequence is usually:
- Use ResNet to establish a strong baseline
- Check whether the data and training pipeline are already stable
- If resources are truly limited, switch to an efficiency-focused architecture
- Finally, do comparisons across architecture families
This makes it easier to see where the gains actually come from than chasing the “most advanced backbone” right away.
The Most Common Misconceptions
Section titled “The Most Common Misconceptions”Misconception 1: Memorizing names without remembering the problem
Section titled “Misconception 1: Memorizing names without remembering the problem”What matters more is knowing:
- What bottleneck is it actually solving?
Misconception 2: The newest architecture is definitely the best fit for the current project
Section titled “Misconception 2: The newest architecture is definitely the best fit for the current project”In reality, a mature strong baseline is often more reliable.
Misconception 3: A deeper network is always stronger
Section titled “Misconception 3: A deeper network is always stronger”Without a good optimization structure, depth easily becomes a burden.
A Very Useful Reflection Question
Section titled “A Very Useful Reflection Question”After learning any architecture, ask yourself:
- What problem is it mainly trying to solve?
- Is it changing “representational power,” or “training stability / efficiency”?
- If I put it into a real project, why would I choose it?
If you can answer these three questions clearly, then this lesson is no longer just a list of model names.
If You Turn This Into a Project or Notes, What Is Most Worth Showing?
Section titled “If You Turn This Into a Project or Notes, What Is Most Worth Showing?”What is usually most worth showing is not:
- A string of model leaderboards
But rather:
- Why you chose ResNet first as a baseline
- Why you considered switching to a more efficiency-focused architecture
- Whether each architecture is solving depth, stability, or efficiency
That way, others can more easily see:
- You understand the logic of architecture selection
- Not just the model names
Evidence to Keep
Section titled “Evidence to Keep”Keep this page’s proof of learning as a small evidence card:
- Dataset Split
- train/test images, class names, and class balance
- Prediction
- label, confidence, and at least one misclassified image
- Metric
- accuracy, F1, confusion matrix, and class-level errors
- Failure Check
- augmentation changes label meaning, class imbalance, leakage, or overfitting
- Expected Output
- model result table and saved error examples
Summary
Section titled “Summary”The most important thing in this section is to build an architecture evolution perspective:
The development of image classification architectures is essentially a continuous effort to solve “how to train deeper models, how to make them stronger with less cost, and how to keep them stable in a modern way.”
As long as you keep this perspective, when you see a new model later, you will not be left with only its name.
What Should You Take Away from This Section?
Section titled “What Should You Take Away from This Section?”- Behind architecture names are bottlenecks and design motivations
- ResNet is the line you most need to truly understand when first learning visual classification architectures
- In projects, a stable baseline is often more important than blindly chasing the newest model
If we compress it into one sentence, it would be:
The most important thing about modern classification architectures is not “who is newer,” but who more clearly solves the two real-world problems of training deep networks and improving efficiency.
Exercises
Section titled “Exercises”- Explain in your own words: why do residual connections make deep networks easier to train?
- Why is EfficientNet more like a “budget optimization” idea?
- If you were building a resource-constrained mobile classifier, which type of architecture would you prioritize?
- Think about it: why should architecture selection not rely only on leaderboards?
Solution approach and explanation
- Residual connections make it easier for a block to learn an identity mapping and give gradients a shorter path backward, so very deep networks become easier to optimize.
- EfficientNet is a budget optimization idea because it scales depth, width, and input resolution together instead of increasing only one dimension blindly.
- For a mobile classifier, prioritize lightweight pretrained models such as MobileNet-style or EfficientNet-style architectures, then verify latency and memory on the target device.
- Leaderboards do not capture your dataset size, latency target, hardware, preprocessing cost, maintainability, or failure modes. Architecture choice should be tested against the actual project constraints.