City University of Hong Kong
Department of Computer Science
Vision and Image
Semester A, 2025/26
This is a 3-credit course.
This course introduces algorithms in computer vision and image processing so
as to develop students with basic knowledge to explain how computer could
understand the visual world. The course describes visual understanding from the
perspective of low-level image processing, mid-level statistical inference, and
high-level vision recognition. The topics include feature extraction, image
segmentation, object recognition, motion analysis and scene understanding, along
with real-world applications that vision algorithms have been successfully
applied.
Course Prerequisites
- CS3334 Data Structures or
CS4335 Design and Analysis of Algorithms, or
equivalent
Textbook
There is no required textbook for the course. The following books serve as
references for the course.
- Richard Szeliski, "Computer Vision:
Algorithms and Applications," 2nd ed., Springer, 2022.
- Rafael C. Gonzalez, Richard E. Woods, ``Digital Image Processing,'' 3rd Edition, Prentice
Hall; ISBN: 013168728X; August 2007.
- Rafael C. Gonzalez, Richard E. Woods, ``Digital Image Processing,'' 4th Edition, Pearson;
ISBN-13: 9780133356724; 2018.
- George Siogkas, "Visual
Media Processing Using Matlab Beginner's Guide," Packt
Publishing, 2013. ISBN-10: 1849697205|ISBN-13: 978-1849697200
- Oge Marques, “Practical Image and
Video Processing Using MATLAB,” Wiley, New York,
NY, 2011. ISBN-10: 0470048158 | ISBN-13: 978-0470048153
- Rafael C. Gonzalez, Richard E. Woods, and S.
L. Eddins, ``Digital
Image Processing Using MATLAB,'' Prentice Hall, 2004. ISBN
0130085197.
- Anil K. Jain, ``Fundamentals of digital image
processing,'' Englewood Cliffs, NJ : Prentice Hall, 1989.
- Y. Wang, J. Ostermann, and Y.Q.Zhang, "Video
Processing and Communications," 1st ed., Prentice Hall, 2002.
ISBN: 0130175471.
- D. Taubman and M. Marcellin, "JPEG2000: Image Compression Fundamentals,
Standards, and Practice," Kluwer, 2001. ISBN: 079237519X.
- David A. Forsyth, Jean Ponce, "Computer
Vision: A Modern Approach," Prentice Hall; 1st edition (August 14, 2002),
ISBN: 0130851981.
- Richard Hartley, Andrew Zisserman, "Multiple
View Geometry in Computer Vision," Paperback: 672 pages; Publisher:
Cambridge University Press; 2 edition (March 25, 2004) ISBN: 0521540518
- Yi Ma, Stefano Soatto, Jana Kosecka, S.
Shankar Sastry, "An Invitation to 3-D
Vision," Hardcover: 526 pages ; Publisher: Springer-Verlag; (November 14,
2003) ISBN: 0387008934
- A. Ardeshir Goshtasby, "2-D
and 3-D Image Registration," Wiley Press, April. 2005. [ebook on
NetLibrary]
- John W. Woods, "Multidimensional Signal, Image, and Video Processing and
Coding," Academic Press; (March 13, 2006), ISBN-10: 0120885166, ISBN-13:
978-0120885169.
- Linda G. Shapiro and George C. Stockman, "Computer Vision," Prentice-Hall,
Inc., Upper Saddle River, New Jersey, 2001 (ISBN 0-13-030796-3).
- Emanuele Trucco and Alessandro Verri, "Introductory Techniques for 3-D
Computer Vision," Prentice-Hall, Inc., Upper Saddle River, New Jersey, 1998
(ISBN 0-13-261108-2).
- Iain E G Richardson, "H.264 and MPEG-4 Video Compression," John Wiley &
Sons, September 2003, ISBN 0-470-84837-5
- M. E. Al-Mualla,
C. N. Canagarajah and D. R. Bull, “Video
Coding for Mobile Communications: Efficiency, Complexity and Resilience”,
Elsevier Science, Academic Press, 2002. ISBN: 0120530791
- A. Gersho, and R. Gray. Vector Quantization and Signal Compression.
Boston: Kluwer Academic Publishers, 1992.
Instructor:
Dr. Dapeng Wu
Office: Y6321, AC-1 Building
Email: dapengwu@cityu.edu.hk
TA:
1) Siyuan Guo
Email: siyuanguo7-c@my.cityu.edu.hk
2) Hong Huang
Email: hohuang-c@my.cityu.edu.hk
3) Yongcan Luo
Email: yongcaluo2-c@my.cityu.edu.hk
4) Hongming Piao
Email: hpiao6-c@my.cityu.edu.hk
5) Tianli Shi
Email: tianlishi2-c@my.cityu.edu.hk
6) Zixuan Tang
Email: zixuatang6-c@my.cityu.edu.hk
7) Ye Tao
Email: yetao34-c@my.cityu.edu.hk.
8) Hao Wang
Email: hwang728-c@my.cityu.edu.hk
9) Shuguang Wang
Email: sgwang6-c@my.cityu.edu.hk
10) Yun Wang
Email: ywang3875-c@my.cityu.edu.hk
11) Renwei Yang
Email: renweyang2-c@my.cityu.edu.hk
12) Jiaxun Ye
Email: jiaxunye-c@my.cityu.edu.hk
13) Jiahao Zheng
Email: jhzheng4-c@my.cityu.edu.hk
Course website: https://www.cs.cityu.edu.hk/~dapengwu/courses/CS5187f25
Meeting Time for Lectures
Friday, 7 pm - 8:50 pm
Meeting Time for Tutorials
Friday, 9 pm - 9:50 pm
Meeting Room for Lectures and Tutorials
Room 3505, AC-2 Building
Course Policies
- During lecture, cell phones should be in a silent mode.
- No late submissions of your homework solution, and project report, are allowed
unless advance permission is granted by
the instructor.

Grading:
| Grades |
Percentage |
Due Dates |
| Homework assignments |
30% |
To be announced |
| Project |
20% |
4pm, Dec. 5 |
| Final exam |
50% |
Dec. 8 -- 20 |
Class Project:
The class project will be done individually. A report is expected to be
generated by each student to document his/her research, critical comparison and
analysis, and his/her new ideas.
For details about
the project, please read here.
Suggested topics for projects are listed here.


Related courses in other schools:
George Mason University,
Computer Vision
Johns Hopkins University,
Image Compression and
Packet Video
Polytechnic University,
Video Processing
Purdue University,
Digital Video
Systems
Stanford University,
Digital Video Processing
University of California, Berkeley,
Multimedia Signal
Processing, Communications and Networking
University of Maryland, College Park,
Digital Image Processing
University of Maryland, College Park,
Multimedia
Communication & Information Security: A Signal Processing Perspective
Useful links
- Anaconda: Anaconda is the
leading open data science platform powered by Python.
- Theano:
Theano is a Python library that lets you to define, optimize, and evaluate
mathematical expressions, especially ones with multi-dimensional arrays (numpy.ndarray).
- TensorFlow: TensorFlow
is an open source software library for numerical computation using data flow
graphs. Nodes in the graph represent mathematical operations, while the graph
edges represent the multidimensional data arrays (tensors) communicated
between them. The flexible architecture allows you to deploy computation to
one or more CPUs or GPUs in a desktop, server, or mobile device with a single
API.
- Keras: Keras is a minimalist,
highly modular neural networks library, written in Python and capable of
running on top of either TensorFlow or Theano. It was developed with a focus
on enabling fast experimentation. Being able to go from idea to result with
the least possible delay is key to doing good research.
- PyTorch: PyTorch is a deep learning
framework for fast, flexible experimentation.
- A curated list of resources dedicated to
recurrent neural networks
-
Source code in Python for handwritten digit recognition, using deep neural
networks: [another
link]
- Source
code in PyTorch for handwritten digit recognition, using deep neural
networks
- Source code in Python for
TF-mRNN: a TensorFlow library for image captioning
- Source code in Python for the following work on image captioning:
- Image captioning:
- Microsoft COCO datasets
- Visual Question Answering:
- Semantic Propositional Image Caption Evaluation (SPICE)
- Region-based Convolutional Neural Networks (R-CNN)
- References:
- Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN:
Towards real-time object detection with region proposal networks." In
Advances in neural information processing systems, pp. 91-99. 2015. [pdf]
- Dai, Jifeng, Yi Li, Kaiming He, and Jian Sun. "R-FCN: Object detection
via region-based fully convolutional networks." In Advances in neural
information processing systems, pp. 379-387. 2016. [pdf]
[source code]
- Huang, Jonathan, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop
Korattikara, Alireza Fathi, Ian Fischer et al. "Speed/accuracy trade-offs
for modern convolutional object detectors." arXiv preprint
arXiv:1611.10012 (2016). [pdf]
(E.g., for Inception V3, extract features from the “Mixed 6e” layer whose
stride size is 16 pixels. Feature maps are cropped and resized to 17x17.)
- Source codes:
- Source code in Python for end-to-end training of LSTM
- Bidirectional Encoder Representations from Transformers (BERT)
- Source code in Python for sequence-to-sequence learning (language
translation, chatbot)
- AI City Challenge
- Visual Storytelling Dataset (VIST)
- Visual storytelling algorithms:
- No Metrics Are Perfect: Adversarial REward Learning for Visual
Storytelling: source codes (TensorFlow)
- Visual Genome is a dataset, a
knowledge base, an ongoing effort to connect structured image concepts to
language.
-
MPII Movie & Description dataset for automatic video description, video
summary, video storytelling
- Bidirectional recurrent neural networks (B-RNN):
- Graves, Alan, Navdeep Jaitly, and Abdel-rahman Mohamed. "Hybrid speech
recognition with deep bidirectional LSTM." IEEE Workshop on Automatic Speech
Recognition and Understanding (ASRU), 2013. [pdf]
- Deep reinforcement learning
- UCL Course on reinforcement learning: [ppt]
[video]
- References:
- Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis
Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing
atari with deep reinforcement learning." arXiv preprint
arXiv:1312.5602 (2013).
- Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel
Veness, Marc G. Bellemare, Alex Graves et al. "Human-level
control through deep reinforcement learning." Nature 518, no.
7540 (2015): 529-533. [source
code]
-
How to Study Reinforcement Learning
- Source codes:
-
Implementation of Reinforcement Learning Algorithms. Python, OpenAI Gym,
Tensorflow. Exercises and Solutions to accompany Sutton's Book and David
Silver's course. [link]
- Generative Adversarial Network (GAN)
- References:
- Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative
adversarial nets." In Advances in neural information processing
systems, pp. 2672-2680. 2014.
- Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised
representation learning with deep convolutional generative adversarial
networks." arXiv preprint arXiv:1511.06434 (2015).
- Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein
GAN." arXiv preprint arXiv:1701.07875 (2017).
- Types of GAN
- Vanilla GAN
- Conditional GAN
- InfoGAN
- Wasserstein GAN
- Mode Regularized GAN
- Coupled GAN
- Auxiliary Classifier GAN
- Least Squares GAN
- Boundary Seeking GAN
- Energy Based GAN
- f-GAN
- Generative Adversarial
Parallelization
- DiscoGAN
- Adversarial Feature
Learning & Adversarially
Learned Inference
- Boundary Equilibrium GAN
- Improved Training for
Wasserstein GAN
- DualGAN
- MAGAN: Margin Adaptation
for GAN
- Softmax GAN
- Source codes:
- A Tensorflow
Implementation of "Deep Convolutional Generative Adversarial Networks":
python code
- Collection
of generative models, e.g. GAN, VAE in Pytorch and Tensorflow:
python code
- Sequential Generative Adversarial Network (GAN)
- References:
- Yu, Lantao, Weinan Zhang, Jun Wang, and Yong Yu. "SeqGAN:
Sequence Generative Adversarial Nets with Policy Gradient." In AAAI,
pp. 2852-2858. 2017.
- Mogren, Olof. "C-RNN-GAN:
Continuous recurrent neural networks with adversarial training." arXiv
preprint arXiv:1611.09904 (2016).
- Im, Daniel Jiwoong, Chris Dongjoo Kim, Hui Jiang, and Roland Memisevic.
"Generating images with
recurrent adversarial networks." arXiv preprint arXiv:1602.05110
(2016).
- Press, Ofir, Amir Bar, Ben Bogin, Jonathan Berant, and Lior Wolf. "Language
Generation with Recurrent Generative Adversarial Networks without
Pre-training." arXiv preprint arXiv:1706.01399 (2017).
- Source codes:
- Subjective
evaluation for content aware video processing techniques
- Cancer imaging
archive: TCIA data are organized as “collections”; typically these are
patient cohorts related by a common disease (e.g. lung cancer), image modality
or type (MRI, CT, digital histopathology, etc) or research focus.
-
MATLAB Tutorial
-
MATLAB Central
-
Matlab Primer,
Matlab Manuals,
Image
Processing Toolbox
-
Matlab implementation of image/video compression algorithms
-
Introduction to Matarix Algebra (free book by Autar K Kaw, Professor,
University of South Florida).
- Matrix Reference
Manual
- HIPR2: a WWW-based Image
Processing Teaching Materials with J
- LIDAR
- Learning by simulations
- OpenCV
- OpenGL
- Download the following
free (open source)
program to record video with screen capture:
http://www.nchsoftware.com/capture/index.html?gclid=CNadwsW6-6wCFSVjTAodbjzTSg
- SD and HD video sequences for
evaluating coding performance of video codec:
http://media.xiph.org/video/derf/
- WebRTC: WebRTC is
a free, open-source project that enables web browsers with Real-Time
Communications (RTC) capabilities via simple JavaScript APIs.
The Missing Semester of Your CS
Education
Standards:
ATSC (Advanced Television Systems Committee) & HDTV (High Definition
Television):
MPEG (Moving Picture Experts Group):
Software:
-
Video codec
- Virtual Dub: VirtualDub
is a video capture/processing utility for 32-bit Windows platforms
(95/98/ME/NT4/2000/XP), licensed under the GNU General Public License (GPL).
- XnView:
is an efficient multimedia viewer, browser and converter.
- ImageJ: Read and write GIF,
JPEG, and ASCII. Read BMP, DICOM, and FITS. [Open Source, Public Domain]
- Open source for image processing tasks:
http://octave.sourceforge.net/doc/image.html
- Photosynth: you can access
gigabytes of photos in seconds, view a scene from nearly any angle, find
similar photos with a single click, and zoom in to make the smallest detail as
big as your monitor.
- Video filtering and compression,
by the Video Group, Moscow State University
- MSU Lossless
Video Codec, by the Video Group, Moscow State University
HSI color
model
Compression link:
http://cchen1.et.ntust.edu.tw/compression/compression.htm
JOURNALS
Elsevier
- Computer Vision and Image Understanding
- Digital Signal Processing: A Review Journal
- Graphical Models and Image Processing
- Journal of Visual Commuication and Image Representation
- Real-Time Imaging
- Computers & Graphics
- Data & Knowledge Engineering
- Image and Vision Computing
- Pattern Recognition
- Pattern Recognition Letters
- Signal Processing
- Signal Processing: Image Communication
IEEE
- IEEE Transactions on Circuits and Systems for Video Technology
- IEEE Transactions on Multimedia
- IEEE Transactions on Image Processing
- IEEE Transactions on Medical Imaging
- IEEE Transactions on PAMI
Kluwer
SPIE
Digital Video and Multimedia Standards Pages
Digital TV and DVD
Overview of the AVI format
Computer Vision
Public Domain Image Databases
CMU Database
Patent licensing
As with MPEG-2
Parts 1 and 2 and
MPEG-4 Part 2 amongst others, the vendors of H.264/AVC products and services
are expected to pay
patent licensing royalties for the patented technology that their products
use. The primary source of licenses for patents applying to this standard is a
private organization known as
MPEG-LA, LLC (which is not affiliated in any way with the MPEG
standardization organization, but which also administers
patent
pools for MPEG-2 Part 1 Systems, MPEG-2 Part 2 Video, MPEG-4 Part 2 Video,
and other technologies).
To search patents, visit free patent searching site:
www.FreePatentsOnline.com.
Free books
Software:
- Virtual Dub: VirtualDub
is a video capture/processing utility for 32-bit Windows platforms
(95/98/ME/NT4/2000/XP), licensed under the GNU General Public License (GPL).
- XnView:
is an efficient multimedia viewer, browser and converter.
- ImageJ: Read and write GIF,
JPEG, and ASCII. Read BMP, DICOM, and FITS. [Open Source, Public Domain]
- Open source for image processing tasks:
http://octave.sourceforge.net/doc/image.html
Related courses in other institutions:
JOURNALS
Elsevier
- Computer Vision and
Image Understanding
- Journal of Visual
Communication and Image Representation
- Data & Knowledge Engineering
- Image and Vision Computing
- Pattern Recognition
- Pattern Recognition Letters
IEEE
- IEEE Transactions on
Circuits and Systems for Video Technology
- IEEE Transactions on Multimedia
- IEEE Transactions on
Image Processing
- IEEE Transactions on
Medical Imaging
- IEEE Transactions on PAMI
Computer Vision
Public Domain Image Databases
CMU Database
