EEL 6562 -- Image Processing and Computer Vision

City University of Hong Kong

Department of Computer Science

Vision and Image

Semester A, 2025/26

This course introduces algorithms in computer vision and image processing so as to develop students with basic knowledge to explain how computer could understand the visual world. The course describes visual understanding from the perspective of low-level image processing, mid-level statistical inference, and high-level vision recognition. The topics include feature extraction, image segmentation, object recognition, motion analysis and scene understanding, along with real-world applications that vision algorithms have been successfully applied.

Course Prerequisites

CS3334 Data Structures or CS4335 Design and Analysis of Algorithms, or equivalent

Textbook

There is no required textbook for the course. The following books serve as references for the course.

Richard Szeliski, "Computer Vision: Algorithms and Applications," 2nd ed., Springer, 2022.
Rafael C. Gonzalez, Richard E. Woods, ``Digital Image Processing,'' 3rd Edition, Prentice Hall; ISBN: 013168728X; August 2007.
Rafael C. Gonzalez, Richard E. Woods, ``Digital Image Processing,'' 4th Edition, Pearson; ISBN-13: 9780133356724; 2018.
George Siogkas, "Visual Media Processing Using Matlab Beginner's Guide," Packt Publishing, 2013. ISBN-10: 1849697205|ISBN-13: 978-1849697200
Oge Marques, “Practical Image and Video Processing Using MATLAB,” Wiley, New York, NY, 2011. ISBN-10: 0470048158 | ISBN-13: 978-0470048153
Rafael C. Gonzalez, Richard E. Woods, and S. L. Eddins, ``Digital Image Processing Using MATLAB,'' Prentice Hall, 2004. ISBN 0130085197.
Anil K. Jain, ``Fundamentals of digital image processing,'' Englewood Cliffs, NJ : Prentice Hall, 1989.
Y. Wang, J. Ostermann, and Y.Q.Zhang, "Video Processing and Communications," 1st ed., Prentice Hall, 2002. ISBN: 0130175471.
D. Taubman and M. Marcellin, "JPEG2000: Image Compression Fundamentals, Standards, and Practice," Kluwer, 2001. ISBN: 079237519X.
David A. Forsyth, Jean Ponce, "Computer Vision: A Modern Approach," Prentice Hall; 1st edition (August 14, 2002), ISBN: 0130851981.
Richard Hartley, Andrew Zisserman, "Multiple View Geometry in Computer Vision," Paperback: 672 pages; Publisher: Cambridge University Press; 2 edition (March 25, 2004) ISBN: 0521540518
Yi Ma, Stefano Soatto, Jana Kosecka, S. Shankar Sastry, "An Invitation to 3-D Vision," Hardcover: 526 pages ; Publisher: Springer-Verlag; (November 14, 2003) ISBN: 0387008934
A. Ardeshir Goshtasby, "2-D and 3-D Image Registration," Wiley Press, April. 2005. [ebook on NetLibrary]
John W. Woods, "Multidimensional Signal, Image, and Video Processing and Coding," Academic Press; (March 13, 2006), ISBN-10: 0120885166, ISBN-13: 978-0120885169.
Linda G. Shapiro and George C. Stockman, "Computer Vision," Prentice-Hall, Inc., Upper Saddle River, New Jersey, 2001 (ISBN 0-13-030796-3).
Emanuele Trucco and Alessandro Verri, "Introductory Techniques for 3-D Computer Vision," Prentice-Hall, Inc., Upper Saddle River, New Jersey, 1998 (ISBN 0-13-261108-2).
Iain E G Richardson, "H.264 and MPEG-4 Video Compression," John Wiley & Sons, September 2003, ISBN 0-470-84837-5
M. E. Al-Mualla, C. N. Canagarajah and D. R. Bull, “Video Coding for Mobile Communications: Efficiency, Complexity and Resilience”, Elsevier Science, Academic Press, 2002. ISBN: 0120530791
A. Gersho, and R. Gray. Vector Quantization and Signal Compression. Boston: Kluwer Academic Publishers, 1992.

Instructor:

Dr. Dapeng Wu
Office: Y6321, AC-1 Building
Email: dapengwu@cityu.edu.hk

TA:

1) Siyuan Guo
Email: siyuanguo7-c@my.cityu.edu.hk

2) Hong Huang
Email: hohuang-c@my.cityu.edu.hk

3) Yongcan Luo
Email: yongcaluo2-c@my.cityu.edu.hk

4) Hongming Piao
Email: hpiao6-c@my.cityu.edu.hk

5) Tianli Shi
Email: tianlishi2-c@my.cityu.edu.hk

6) Zixuan Tang

Email: zixuatang6-c@my.cityu.edu.hk

7) Ye Tao

Email: yetao34-c@my.cityu.edu.hk.

8) Hao Wang
Email: hwang728-c@my.cityu.edu.hk

9) Shuguang Wang
Email: sgwang6-c@my.cityu.edu.hk

10) Yun Wang
Email: ywang3875-c@my.cityu.edu.hk

11) Renwei Yang
Email: renweyang2-c@my.cityu.edu.hk

12) Jiaxun Ye

Email: jiaxunye-c@my.cityu.edu.hk

13) Jiahao Zheng

Email: jhzheng4-c@my.cityu.edu.hk

Course website: https://www.cs.cityu.edu.hk/~dapengwu/courses/CS5187f25

Meeting Time for Lectures

Friday, 7 pm - 8:50 pm

Meeting Time for Tutorials

Friday, 9 pm - 9:50 pm

Meeting Room for Lectures and Tutorials

Room 3505, AC-2 Building

Course Policies

During lecture, cell phones should be in a silent mode.
No late submissions of your homework solution, and project report, are allowed unless advance permission is granted by the instructor.

Grading:

Grades	Percentage	Due Dates
Homework assignments	30%	To be announced
Project	20%	4pm, Dec. 5
Final exam	50%	Dec. 8 -- 20

Class Project:

The class project will be done individually. A report is expected to be generated by each student to document his/her research, critical comparison and analysis, and his/her new ideas. For details about the project, please read here.

Suggested topics for projects are listed here.

Related courses in other schools:

George Mason University, Computer Vision

Johns Hopkins University, Image Compression and Packet Video

Polytechnic University, Video Processing

Purdue University, Digital Video Systems

Stanford University, Digital Video Processing

University of California, Berkeley, Multimedia Signal Processing, Communications and Networking

University of Maryland, College Park, Digital Image Processing

University of Maryland, College Park, Multimedia Communication & Information Security: A Signal Processing Perspective

Useful links

Anaconda: Anaconda is the leading open data science platform powered by Python.
Theano: Theano is a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multi-dimensional arrays (numpy.ndarray).
TensorFlow: TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.
Keras: Keras is a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.
PyTorch: PyTorch is a deep learning framework for fast, flexible experimentation.
A curated list of resources dedicated to recurrent neural networks
Source code in Python for handwritten digit recognition, using deep neural networks: [another link]
Source code in PyTorch for handwritten digit recognition, using deep neural networks
Source code in Python for TF-mRNN: a TensorFlow library for image captioning
Source code in Python for the following work on image captioning:
- Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, Show and Tell: A Neural Image Caption Generator, CVPR 2015
  - Implementation
Image captioning:
- Zhe Gan, et. al, Semantic Compositional Networks for Visual Captioning, CVPR 2017
  - Implementation Source code in Python (Theano)
- Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering: source codes (Caffe) and source codes (PyTorch)
Microsoft COCO datasets
Visual Question Answering:
- Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering: source codes (Caffe) and VQA source code (PyTorch)
Semantic Propositional Image Caption Evaluation (SPICE)
- Source code in JAVA to calculate SPICE
Region-based Convolutional Neural Networks (R-CNN)
- References:
  - Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN: Towards real-time object detection with region proposal networks." In Advances in neural information processing systems, pp. 91-99. 2015. [pdf]
  - Dai, Jifeng, Yi Li, Kaiming He, and Jian Sun. "R-FCN: Object detection via region-based fully convolutional networks." In Advances in neural information processing systems, pp. 379-387. 2016. [pdf] [source code]
  - Huang, Jonathan, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer et al. "Speed/accuracy trade-offs for modern convolutional object detectors." arXiv preprint arXiv:1611.10012 (2016). [pdf] (E.g., for Inception V3, extract features from the “Mixed 6e” layer whose stride size is 16 pixels. Feature maps are cropped and resized to 17x17.)
- Source codes:
  - A Faster Pytorch Implementation of Faster R-CNN (PyTorch)
  - Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering: source codes (Caffe)
Source code in Python for end-to-end training of LSTM
- Implementation
Bidirectional Encoder Representations from Transformers (BERT)
- Implementation in TensorFlow
- Implementation in PyTorch
Source code in Python for sequence-to-sequence learning (language translation, chatbot)
- TensorFlow seq2seq library
- Implementation 1 on Tensorflow with separable encoder and decoder
- Implementation 2 on Keras
AI City Challenge
Visual Storytelling Dataset (VIST)
- Visual storytelling algorithms:
  - No Metrics Are Perfect: Adversarial REward Learning for Visual Storytelling: source codes (TensorFlow)
Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.
MPII Movie & Description dataset for automatic video description, video summary, video storytelling
Bidirectional recurrent neural networks (B-RNN):
- Graves, Alan, Navdeep Jaitly, and Abdel-rahman Mohamed. "Hybrid speech recognition with deep bidirectional LSTM." IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2013. [pdf]
Deep reinforcement learning
- UCL Course on reinforcement learning: [ppt] [video]
- References:
  - Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
  - Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533. [source code]
  - How to Study Reinforcement Learning
- Source codes:
  - Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym, Tensorflow. Exercises and Solutions to accompany Sutton's Book and David Silver's course. [link]
Generative Adversarial Network (GAN)
- References:
  - Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative adversarial nets." In Advances in neural information processing systems, pp. 2672-2680. 2014.
  - Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015).
  - Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein GAN." arXiv preprint arXiv:1701.07875 (2017).
- Types of GAN
- Source codes:
  - A Tensorflow Implementation of "Deep Convolutional Generative Adversarial Networks": python code
  - Collection of generative models, e.g. GAN, VAE in Pytorch and Tensorflow: python code
Sequential Generative Adversarial Network (GAN)
- References:
  - Yu, Lantao, Weinan Zhang, Jun Wang, and Yong Yu. "SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient." In AAAI, pp. 2852-2858. 2017.
  - Mogren, Olof. "C-RNN-GAN: Continuous recurrent neural networks with adversarial training." arXiv preprint arXiv:1611.09904 (2016).
  - Im, Daniel Jiwoong, Chris Dongjoo Kim, Hui Jiang, and Roland Memisevic. "Generating images with recurrent adversarial networks." arXiv preprint arXiv:1602.05110 (2016).
  - Press, Ofir, Amir Bar, Ben Bogin, Jonathan Berant, and Lior Wolf. "Language Generation with Recurrent Generative Adversarial Networks without Pre-training." arXiv preprint arXiv:1706.01399 (2017).
- Source codes:
  - Implementation of C-RNN-GAN
  - Tensorflow Implementation of GAN modeling for sequential data
Subjective evaluation for content aware video processing techniques
Cancer imaging archive: TCIA data are organized as “collections”; typically these are patient cohorts related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus.
MATLAB Tutorial
MATLAB Central
Matlab Primer, Matlab Manuals, Image Processing Toolbox
Matlab implementation of image/video compression algorithms
Introduction to Matarix Algebra (free book by Autar K Kaw, Professor, University of South Florida).
Matrix Reference Manual
HIPR2: a WWW-based Image Processing Teaching Materials with J
LIDAR
Learning by simulations
OpenCV
OpenGL
Download the following free (open source) program to record video with screen capture: http://www.nchsoftware.com/capture/index.html?gclid=CNadwsW6-6wCFSVjTAodbjzTSg
SD and HD video sequences for evaluating coding performance of video codec: http://media.xiph.org/video/derf/
WebRTC: WebRTC is a free, open-source project that enables web browsers with Real-Time Communications (RTC) capabilities via simple JavaScript APIs.

The Missing Semester of Your CS Education

Standards:

H.264 tutorial
H.263
MPEG4 overview can be found at http://www.chiariglione.org/mpeg/standards/mpeg-4/mpeg-4.htm
JPEG XR
KTA (contender for future H.265)

ATSC (Advanced Television Systems Committee) & HDTV (High Definition Television):

ATSC.org

HDTV

SMPTE.org

MPEG (Moving Picture Experts Group):

MPEG.org

MPEG standards committee

MPEG TV

MP3

MPEG Audio Layer-3

Software:

Video codec
Virtual Dub: VirtualDub is a video capture/processing utility for 32-bit Windows platforms (95/98/ME/NT4/2000/XP), licensed under the GNU General Public License (GPL).
XnView: is an efficient multimedia viewer, browser and converter.
ImageJ: Read and write GIF, JPEG, and ASCII. Read BMP, DICOM, and FITS. [Open Source, Public Domain]
Open source for image processing tasks: http://octave.sourceforge.net/doc/image.html
Photosynth: you can access gigabytes of photos in seconds, view a scene from nearly any angle, find similar photos with a single click, and zoom in to make the smallest detail as big as your monitor.
- Refer to: http://labs.live.com/photosynth/
- A demo video: http://www.ted.com/index.php/talks/view/id/129
Video filtering and compression, by the Video Group, Moscow State University
MSU Lossless Video Codec, by the Video Group, Moscow State University

HSI color model

Compression link: http://cchen1.et.ntust.edu.tw/compression/compression.htm

JOURNALS

Elsevier

Computer Vision and Image Understanding
Digital Signal Processing: A Review Journal
Graphical Models and Image Processing
Journal of Visual Commuication and Image Representation
Real-Time Imaging
Computers & Graphics
Data & Knowledge Engineering
Image and Vision Computing
Pattern Recognition
Pattern Recognition Letters
Signal Processing
Signal Processing: Image Communication

IEEE

IEEE Transactions on Circuits and Systems for Video Technology
IEEE Transactions on Multimedia
IEEE Transactions on Image Processing
IEEE Transactions on Medical Imaging
IEEE Transactions on PAMI

Kluwer

SPIE

Journal of Electronic Imaging

Digital Video and Multimedia Standards Pages

Digital TV and DVD

Overview of the AVI format

Signal Processing Information Base (SPIB)

Computer Vision

Computer Vision Homepage at CMU
Annotated Computer Vision Bibliography from USC IRIS
CVonline: The Evolving, Distributed, Non-Proprietary, On-Line Compendium of Computer Vision
3-D for Everyone
Red-blue glasses or anaglyph for 3D viewing: http://www.best3dglasses.com/anaglyph.html
Shutter glasses for 3D viewing: http://www.stereo3d.com/shutter.htm
3D cameras: http://www.ptgrey.com/index.asp
3D photos at http://www.jessemazer.com/3Dphotos.html
3D video sequences can be downloaded at: http://research.microsoft.com/vision/InteractiveVisualMediaGroup/3DVideoDownload/

Public Domain Image Databases

CMU Database

Patent licensing

As with MPEG-2 Parts 1 and 2 and MPEG-4 Part 2 amongst others, the vendors of H.264/AVC products and services are expected to pay patent licensing royalties for the patented technology that their products use. The primary source of licenses for patents applying to this standard is a private organization known as MPEG-LA, LLC (which is not affiliated in any way with the MPEG standardization organization, but which also administers patent pools for MPEG-2 Part 1 Systems, MPEG-2 Part 2 Video, MPEG-4 Part 2 Video, and other technologies).

To search patents, visit free patent searching site: www.FreePatentsOnline.com.

Free books

Introduction to Matarix Algebra (free book by Autar K Kaw, Professor, University of South Florida).
Mathematics for Machine Learning
Top 13 (free) must read machine leaning books for beginners
100+ free machine learning books

Software:

Virtual Dub: VirtualDub is a video capture/processing utility for 32-bit Windows platforms (95/98/ME/NT4/2000/XP), licensed under the GNU General Public License (GPL).
XnView: is an efficient multimedia viewer, browser and converter.
ImageJ: Read and write GIF, JPEG, and ASCII. Read BMP, DICOM, and FITS. [Open Source, Public Domain]
Open source for image processing tasks: http://octave.sourceforge.net/doc/image.html

Related courses in other institutions:

Stanford University CS221: Artificial Intelligence: Principles and Techniques: [video]
Stanford University CS224n: Natural Language Processing with Deep Learning: [video]
Stanford University CS229 - Machine Learning: notes and video can be found on this web
Stanford University CS230 Deep Learning: [ppt] [video]
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition: [ppt] [video]
UCL Course on reinforcement learning: [ppt] [video]
RWTH Aachen University Implementation of Heuristic Algorithms for Board Games
Mila - Quebec AI Institute Introduction to Causal Inference
UC Berkeley Foundations of Deep Reinforcement Learning
DeepMind Reinforcement Learning Lecture Series

JOURNALS

Elsevier

Computer Vision and Image Understanding
Journal of Visual Communication and Image Representation
Data & Knowledge Engineering
Image and Vision Computing
Pattern Recognition
Pattern Recognition Letters

IEEE

IEEE Transactions on Circuits and Systems for Video Technology
IEEE Transactions on Multimedia
IEEE Transactions on Image Processing
IEEE Transactions on Medical Imaging
IEEE Transactions on PAMI

Computer Vision

Computer Vision Homepage at CMU
Annotated Computer Vision Bibliography from USC IRIS
CVonline: The Evolving, Distributed, Non-Proprietary, On-Line Compendium of Computer Vision
3-D for Everyone
Red-blue glasses or anaglyph for 3D viewing: http://www.best3dglasses.com/anaglyph.html
Shutter glasses for 3D viewing: http://www.stereo3d.com/shutter.htm
3D cameras: http://www.ptgrey.com/index.asp
3D photos at http://www.jessemazer.com/3Dphotos.html
3D video sequences can be downloaded at: http://research.microsoft.com/vision/InteractiveVisualMediaGroup/3DVideoDownload/

Public Domain Image Databases

CMU Database

CS 5187

Vision and Image

Semester A, 2025/26

Course Description

Course Prerequisites

Textbook

Instructor:

TA:

Course website: https://www.cs.cityu.edu.hk/~dapengwu/courses/CS5187f25

Meeting Time for Lectures

Meeting Time for Tutorials

Meeting Room for Lectures and Tutorials

Course Policies

Grading:

Class Project:

JOURNALS

Digital Video and Multimedia Standards Pages

Digital TV and DVD

Signal Processing Information Base (SPIB)

Computer Vision

3-D for Everyone

Public Domain Image Databases

Patent licensing

JOURNALS

Computer Vision

3-D for Everyone

Public Domain Image Databases