This page documents the weekly progress of our 12-770 course project on acoustic anomaly detection for small fan and HVAC-like systems. Our work combines hardware setup, real-world audio acquisition, signal analysis, baseline model development, and interactive demo deployment.
At the current stage, the project has progressed through three major phases:
The long-term goal of this project is to build a practical anomaly detection pipeline that can distinguish normal and abnormal fan operating conditions from sound recordings, and later adapt the system more closely to our own physical setup through fine-tuning.
During the first week, the main focus was on building a functional sensing pipeline using the ESP32 and the INMP441 I2S microphone. The setup involved wiring the I2S interface (WS, SCK, SD), configuring the ESP32 I2S driver, and streaming acquired data through serial communication.
Several problems were identified during the initial setup phase:
The following possible causes were tested during debugging:
WS, SCK, and SDL/R pin connection, leading to channel selection issuesTo isolate the problem, the following steps were taken:
L/R pin was fixed rather than left floatingBy the end of Week 2, a stable data acquisition pipeline had been established, and the microphone was confirmed to output valid non-zero signals.
In the second week, the project moved from hardware debugging to signal quality improvement, preliminary signal validation, and the setup of the baseline anomaly detection model. The goal was not only to confirm that the sensing system could capture meaningful differences across operating conditions, but also to begin constructing the machine-learning pipeline that will later support anomaly detection.
A major issue involving distorted or difficult-to-interpret output was addressed.
Three scenarios were recorded for comparison:
All experiments were conducted under real-world conditions with persistent environmental noise from laptop operation.
The collected data were analyzed in Python using:
Results are shown below:

Even in the presence of persistent background noise, the system was able to distinguish among different operating conditions. This suggests that amplitude-based features are both robust and sensitive to mechanical state changes.
In parallel with signal analysis, we also established the baseline model training pipeline for acoustic anomaly detection. The objective of this stage was to convert raw audio into structured spectral features and train a reconstruction-based model that can later be used to distinguish normal and abnormal operating sounds.
The input .wav files are first loaded as floating-point waveforms and resampled to 16,000 Hz. This standardizes the sampling rate while preserving the dominant frequency content relevant to fan anomaly detection.
Next, Short-Time Fourier Transform (STFT) is applied with:
n_fft = 1024hop_length = 512This transforms the waveform into a time-frequency representation, allowing the model to capture how spectral content evolves over time.
The linear-frequency spectrum is then mapped to 64 mel bands, producing a compact mel-spectrogram representation. After that, the mel-spectrogram is converted to log-mel energy, which compresses the dynamic range and improves numerical stability for learning.
Instead of using one spectral frame at a time, the model stacks 5 consecutive frames together. Since each frame contains 64 mel coefficients, this results in a 320-dimensional input vector:
64 mel coefficients × 5 frames = 320 dimensionsThis step allows the model to capture short-term temporal context rather than relying only on a single instantaneous spectral snapshot.
For each audio clip, frame stacking is applied in a sliding-window manner across the full signal. As a result, one audio file is converted into a sequence of 320-dimensional vectors.
This design increases the effective number of training samples and allows file-level anomaly scores to be computed by aggregating reconstruction errors across many local feature vectors.
The baseline anomaly detector is implemented as a fully connected autoencoder.
The hidden layers use ReLU activation functions to provide nonlinear representation power, while the output layer remains linear so that the reconstructed output can match the continuous log-mel feature space.
The 8-dimensional bottleneck layer forces the network to retain only the most salient structure of normal acoustic patterns. As a result, abnormal sounds are expected to be reconstructed less accurately.
The model is trained using mean squared error (MSE) between the input vector and the reconstructed output:
This makes MSE suitable both as the training loss and as the anomaly score during inference.
By the end of Week 3, the project had achieved progress on both the sensing and modeling sides:
In the third week, the project progressed from offline analysis and baseline model setup to interactive demonstration and real-world recording validation. The main focus was on building a working demo interface, testing the model with externally recorded fan sounds, and extending the system from offline uploaded-audio analysis to browser-based live microphone inference.
During this stage, we built an interactive demo for anomaly detection. The demo was designed to support both uploaded audio analysis and live microphone analysis using the same preprocessing and reconstruction-based inference pipeline.
The updated demo now supports:
In live mode, the browser continuously streams microphone audio to the backend. Instead of analyzing each tiny chunk independently, the system maintains a rolling buffer containing the most recent 10 seconds of audio. After at least 2 seconds of signal have been accumulated, the backend begins inference and updates the prediction every 1 second. At each update, the buffered waveform is converted into stacked log-mel features and passed through the four trained models to generate anomaly scores and threshold-based decisions.
This step was important because it connected the modeling pipeline to a user-facing interface and made the anomaly detection results easier to test and interpret in practice.
To evaluate the system under more realistic conditions, we conducted a new set of recording experiments using a smartphone microphone.
The smartphone was positioned:
Three voltage conditions were tested:
For each voltage level, two groups of experiments were conducted:
The goal of this experiment was to determine whether the system could capture meaningful acoustic differences not only across different operating voltages, but also between normal and disturbed airflow conditions.
This experiment also served as an important intermediate step between controlled baseline data and future fine-tuning on real fan recordings.
From the smartphone recordings, the acoustic difference between the two operating groups was clearly perceptible.
Key observations included:
These results suggest that the fan sound contains meaningful acoustic signatures related both to operating intensity and to physical disturbance conditions.
In addition to offline recording tests, we further developed and tested a real-time demo workflow based on browser microphone streaming.
In this setup:
This design provided a much more practical form of online anomaly monitoring than the earlier file-only workflow. Rather than waiting for a recording to finish before processing, the system could continuously evaluate the most recent segment of incoming sound and update the results interactively.
This experiment was an important step toward live anomaly detection, since it moved the system closer to practical deployment rather than relying only on pre-recorded .wav files.
By the end of Week 3, the project had achieved the following progress:
Across the first three weeks, the project has achieved the following milestones:
The next phase of the project will focus on improving the connection between real-world recordings and the anomaly detection model. Planned directions include:
Through this assignment, our team reached a strong consensus on our system architecture and experimental validation strategies. Here is what each team member learned during this phase:
Through deploying our Gradio-based interactive diagnostic demo, I gained a deep understanding of setting scientific decision boundaries. By analyzing validation score distributions, I configured precise MSE thresholds (e.g., 6.9, 6.0, 5.3, 7.0) for four machine IDs. A major learning point was balancing response speed and diagnostic stability during live microphone integration. I implemented a 10-second rolling buffer that accumulates 2 seconds of audio before updating predictions every second. Mastering this tradeoff between “time window length” and “interpretability” in real-time engineering was my core takeaway, laying the groundwork for future model adaptation in real physical environments.
My primary focus was building and training the anomaly detection models using the MIMII dataset to accurately classify the operational status of fan motors. I gained hands-on experience by implementing and comparing two distinct architectures: a baseline Dense Autoencoder (DenseAE) and a Residual Autoencoder (ResidualAE). Throughout the optimization process, I learned the critical importance of robust training strategies. I experimented with various techniques to improve model generalization, including injecting synthetic data noise, modifying the underlying network structures, and systematically tuning hyperparameters. These iterative refinements significantly deepened my understanding of autoencoder behaviors in acoustic anomaly detection.
My key learning revolved around the complexities of real-time acoustic data acquisition. Initially building an ESP32 + INMP441 system, I spent significant time debugging issues like discontinuous signals and serial bandwidth bottlenecks at high sampling rates, eventually pivoting to SD card local storage. To ensure our demo progressed smoothly, I engineered a stable PC-based USB microphone pipeline featuring real-time recording and waveform/spectrogram visualization. Resolving buffer and shape mismatch errors taught me how to effectively manage audio data structures. Ultimately, transitioning from offline audio to window-sliced real-time streaming was a crucial structural agreement our team reached this week.