The analysis of landscape design and plant selection under deep learning - Scientific Reports - MarketAlert – Real-Time Market & Crypto News, Analysis & Alerts

Image recognition technique and traditional image segmentation algorithm

The application of digital image processing techniques extends across various domains, including transportation, industry, biology, and other fields. Specifically within landscape design, these technologies are harnessed for the manipulation of natural landscape photographs. First, landscape photos are transferred to the computer, and programs are obtained. Subsequently, the programming information is input into the landscape element recognition system to realize the automatic recognition of landscape elements. The core part of this process is the image recognition system. An image recognition system is a system based on computer vision and machine learning technologies used to identify and classify objects, items, or scenes within images. The composition of the image recognition system is shown in Fig. 1.

In Fig. 1, typically, an image recognition system consists of four main components: image pre-processing, image feature extraction, scene element recognition, and output results. Image pre-processing is the first step in an image recognition system aimed at converting the raw image into a form suitable for analysis. Pre-processing tasks include noise removal, resizing and adjusting the image’s resolution, contrast enhancement, color normalization, and more. The enhancement of image feature extraction and recognition accuracy in subsequent stages is facilitated through pre-processing. Subsequently, during the image feature extraction phase, the system isolates valuable features from the pre-processed image. These features can include information such as edges, corners, textures, color histograms, shapes, and more. The goal of feature extraction is to convert image information into numerical or vector form, making it understandable and processable by machine learning algorithms. Scene element recognition is the core part of an image recognition system and an area where DL technologies shine. In this stage, machine learning algorithms use the extracted features from the image to identify objects, items, or scenes within the image. This may cover multiple categories, such as animals, traffic signs, buildings, natural landscapes, and more. DL models such as CNN and Recurrent Neural Network models are frequently employed for proficient image classification and recognition tasks. The Output Result is the final step, where the recognition results are presented to users or other applications. This is typically presented in the form of text or visualization, indicating what objects or scenes have been identified in the image. The output results can be integrated with other information for automated decision-making, system monitoring, or enhanced user experiences. An image recognition system goes through these four steps, transforming visual information into understandable and useful data, enabling automated image recognition and classification, and can excel in the field of landscape design and plant selection.

Traditional landscape image segmentation algorithms are based on These methods aim to partition images into different regions or objects. Common traditional segmentation algorithms include threshold segmentation, region growing, region segmentation, and edge detection. Table 1 illustrates the comparison between various algorithms.

In Table 1, these traditional segmentation algorithms typically rely on features such as brightness, color, and texture of the image for segmentation. However, they may be sensitive to noise and changes in lighting conditions and often require manual selection of parameters or seed points. With the development of DL technologies, modern segmentation methods, especially CNN and semantic segmentation models, have made significant advancements in image segmentation tasks. This is because they can automatically learn complex image features, thereby improving segmentation accuracy and robustness.

CNN is a class of artificial neural network models used for image processing, pattern recognition, and DL tasks. CNN draws inspiration from the human visual system and has achieved significant breakthroughs in the field of image processing. They have demonstrated outstanding performance in various applications, including image classification, object detection, semantic segmentation, and facial recognition. In CNN, one neuron can correspond to others within its coverage to achieve weight sharing and process large images. The neural network can effectively reduce the number of learning parameters in training and improve the algorithm’s performance. The structure of CNN is shown in Fig. 2.

In Fig. 2, the CNN represents a hierarchical architecture comprising multiple convolutional, pooling, and fully connected layers, mirroring the cognitive processes involved in visual information processing within the human brain. Firstly, the convolutional layer, serving as the foundational element of the CNN, employs convolutional kernels to filter the input image, extracting pertinent features. This operation is instrumental in capturing local information and spatial relationships, facilitating the effective learning of features for visual tasks. Secondly, the pooling layer diminishes the spatial dimensions of the image, concurrently mitigating computational complexity and enhancing resilience to image features. Common pooling operations, such as max pooling and average pooling, sample the output of convolutional layers to distil more significant features. Lastly, the fully connected layer maps the features extracted by convolutional and pooling layers to the output layer, culminating in the final classification or recognition of the image. Each neuron in the fully connected layer establishes connections with all neurons in the preceding layer, forming a comprehensive perception structure. The amalgamation and stacking of these layers within CNN construct its deep architecture, empowering the model to make substantial strides in comprehending intricate image features and abstract representations. Convolutional layers execute filtering operations through defined convolutional kernels — small windows or filters sliding over the input image, transforming it into a series of feature maps. This process involves element-wise multiplication of the convolutional kernel with the corresponding region of the input image, followed by accumulation to yield the output of the convolution operation. A pivotal characteristic of this operation is weight sharing, where the weight parameters of the convolutional kernel remain consistent throughout its traversal of the entire input image. This weight-sharing mechanism minimizes network parameters, enhances model efficiency, and imparts translational invariance to the network. The convolutional layer employs diverse convolutional kernels to learn various features of the image, including edges, textures, and shapes. These acquired features undergo progressive abstraction as the network’s depth increases, culminating in higher-level representations that afford the network an abstract understanding of the input image. Following the convolutional layer, an activation function, such as the Rectified Linear Unit (ReLU), is commonly integrated to introduce non-linearity and amplify the expressive capacity of the network. Furthermore, pooling layers are consistently interleaved with convolutional layers, facilitating the downsampling of feature maps. This process serves to systematically diminish data dimensions, enhance computational efficiency, and retain crucial features in the network. The working process of the convolution layer is shown in Eq. (1).

In Eq. (1), l is the number of convolution layers. is a nonlinear activation function. is the b-th characteristic map of convolution layer l. is the convolution kernel. is the set of input characteristic graphs. is the offset term. The pooling layer functions by specifying the size of a pooling window and systematically traversing it over the feature map. This window, a fixed-size region, moves incrementally across the feature map, conducting an aggregation operation on the values within each window. Commonly employed pooling operations include Max Pooling and Average Pooling. In Max Pooling, values within the pooling window are substituted with the maximum value, while Average Pooling replaces them with the average value within the window. The primary objective of pooling is to retain crucial information while diminishing the dimensions of the feature map, thereby enhancing computational efficiency. The pooling layer’s operations enable the network to reduce data size without compromising features. This reduction is instrumental in minimizing the number of network parameters, mitigating computational complexity, preventing overfitting, and bolstering the model’s robustness. Moreover, the pooling layer contributes to imparting a degree of invariance to translation and small deformations within the network. In the CNN structure, convolutional layers and pooling layers conventionally alternate. The iterative stacking of these layers facilitates the network’s progressive acquisition of increasingly abstract feature representations. Its working process is shown in Eq. (2).

In Eq. (2), is the b-th feature map in pooling layer i. is the value after sampling. down (*) denotes the subsampling function. Each neuron within the fully connected layer establishes connections with all neurons in the preceding layer, creating a globally connected structure that allows the network to consider all features in the input image and amalgamate them into a reduced-dimensional output. Each connection is assigned a weight parameter throughout this connectivity process, signifying its significance. These weight parameters are iteratively learned during the network’s training using optimization algorithms such as backpropagation and gradient descent. This implies the automatic adjustment of connection weights to better align with the training data. Furthermore, the output of each neuron is computed by multiplying the output of neurons in the preceding layer by the weights of their connections and adding a bias. This introduces non-linearity by applying an activation function, thereby enhancing the network’s capacity to learn intricate features and relationships. The ultimate output of the fully connected layer undergoes processing through the Softmax function. This transformation converts neuron outputs into a distribution representing various categories’ probabilities. Consequently, the network is equipped to perform multi-class classification on input images, furnishing the probability associated with each category. The principal role of the fully connected layer is to integrate features extracted by convolutional and pooling layers, culminating in a high-level abstract representation of the input image. Subsequently, it conducts classification based on the learned weight parameters. This pivotal process facilitates the network’s ability to discern and categorize complex patterns, contributing to its overall effectiveness in image classification tasks. In applications such as landscape design and plant selection, which necessitate the preservation of spatial structures and detailed information, CNN models employ convolution and pooling operations in image processing. However, this approach often leads to the loss of information regarding the original image size. The fundamental innovation of the FCN model lies in replacing the fully connected layers of traditional CNN models with a fully convolutional structure. Within the FCN framework, the concluding layers of the network employ transposed convolution operations. This strategic choice serves to restore the spatial dimensions of the feature maps to the size of the original input image. This design ensures that the network retains crucial spatial information in the output, thereby better aligning with the demands for detailed image representation. Furthermore, FCN introduces the concept of skip connections, integrating features from intermediate layers with upsampled features. This approach proves instrumental in capturing features at different hierarchical levels more effectively, thereby enhancing the model’s perceptual capabilities. The FCN’s innovative design addresses the limitations associated with traditional CNN models, particularly in tasks requiring nuanced spatial information and detailed image analysis. The structure of FCN is illustrated in Fig. 3.

In Fig. 3, the FCN image semantic segmentation model comprises three primary steps: image pre-processing, feature extraction and fusion, and classification. This discussion focuses particularly on the feature extraction and fusion section, where the FCN model employs specific structures to achieve this process. The feature extraction and fusion section is intricately composed of three main components: the feature extraction network, upsampling layers, and skip connections. Within the FCN model, the core responsibility of the feature extraction network is to discern essential features from the input image. This stage typically incorporates a sequence of convolutional layers meticulously designed to capture semantic features inherent in the image. Each convolutional layer transforms the original image into feature maps, where each map corresponds to specific semantic features such as edges, textures, and shapes. Upsampling layers play a crucial role in enlarging low-resolution feature maps to match the size of the input image through upsampling operations. Simultaneously, skip connections facilitate the fusion of feature information from different levels, allowing for the amalgamation of high-level and low-level features. This concurrent capture of local and global semantic information contributes significantly to enhancing the performance of the segmentation model, ensuring a more comprehensive understanding of the input image’s intricate features.

A fundamental distinction between the FCN and traditional CNN lies in FCN’s unique capacity to operate without a fixed input image size, directly classifying individual pixels within the image. Notably, FCN diverges from conventional CNN models by featuring a final layer that is a convolutional layer rather than a fully connected layer, resulting in a segmented image as the output. In this context, the convolutional layer assumes the role of the fully connected layer, a process referred to as convolution. Subsequent to convolution and pooling layers, the downsized image undergoes a progressive restoration to its original size by incorporating upsampling layers, specifically transpose convolution. The conclusive prediction for each pixel in the output image is determined by selecting the maximum value, denoting probability, across all pixels at that position across all images for classification. The FCN model is trained via supervised learning, utilizing the ReLU activation function. To forestall overfitting, a Dropout layer is integrated into the model, zeroing certain neuron outputs, thereby enhancing the model’s generalization and robustness. This comprehensive process encapsulates the core concept and workflow of the FCN image semantic segmentation model, showcasing its unique adaptability to varying input sizes and its ability to classify individual pixels precisely.

The variants of FCN include FCN-8s, FCN-16s, and FCN-32s, aimed at improving the performance and accuracy of the segmentation model. The numbers 32s, 16s, and 8s represent the upsampling strides, meaning the feature maps are enlarged by a factor of 32, 16, and 8, respectively. A lower stride typically implies higher-resolution segmentation results but also requires more computational resources. FCN-32s is one of the earliest variants of FCN, utilizing a fully convolutional structure and upsampling to enlarge low-resolution feature maps to the size of the input image. It uses skip connections to fuse feature information from different levels, making the segmentation results smoother. FCN-16s builds upon FCN-32s by introducing more feature fusion layers to enhance segmentation quality. It merges feature maps from both shallow and deep levels through skip connections, allowing the model to capture semantic information from different levels. FCN-8s is an extension of FCN-16s, introducing additional feature fusion layers and upsampling operations. It fuses feature information from more levels through skip connections to improve segmentation accuracy. These models have achieved significant success in image semantic segmentation tasks, especially in fields such as autonomous driving, medical image analysis, remote sensing image analysis, and real-time segmentation. However, their application in the field of landscape design is relatively limited. Therefore, this paper applies three variants of FCN to landscape design to create outdoor environments that are more innovative, sustainable, and aesthetically valuable.

Building upon traditional CNN and FCN models, this paper introduces a multi-scale feature fusion module. This module effectively integrates multi-scale information by combining convolutional feature maps from different layers. Specifically, through operations such as weighted averaging and max pooling, information is consolidated between low-level and high-level feature maps, thereby enhancing the model’s ability to recognize and segment various elements in complex landscape images. For instance, when processing water landscape elements, low-level feature maps focus more on texture and edge information, while high-level feature maps are better at capturing semantic details. By fusing both, the model can more accurately identify and segment water features. In the encoder section of the FCN model, a channel attention mechanism based on SE blocks is introduced. The SE block adaptively recalibrates the weights of each channel, enabling the model to focus on more important feature channels. Additionally, a spatial attention mechanism is incorporated in the decoder section, allowing the model to focus on spatial features in important areas of the image. This enhanced attention mechanism significantly improves the model’s focus on complex landscape elements, particularly in fine-grained segmentation tasks. To further improve the accuracy of landscape element recognition, this paper combines object detection technology (Faster R-CNN) with the FCN image segmentation model. First, Faster R-CNN is used to detect key landscape elements in the image by generating candidate bounding boxes to mark the locations of these elements. Then, these candidate regions are input into the FCN model for fine-grained image segmentation. This method effectively reduces the interference of background noise while enhancing the model’s accuracy in capturing the boundaries and details of landscape elements, showing particularly high accuracy in PLIR.

In landscape design, plant selection is an important step, and the prerequisite for successful plant selection is the accurate identification of plants. Traditional plant recognition methods require tedious pre-processing steps, followed by feature extraction for classification. However, the extracted features often lack consistent quality standards, leading to classification inaccuracies. In contrast, CNN can directly take raw plant images as input without the need for cumbersome pre-processing. However, CNN requires a large number of training samples to perform well, while plant leaf databases are typically small-sample data. Therefore, the experiment chooses to use TL methods for classification. TL is a machine learning approach that leverages existing knowledge to solve problems in different but related domains, with the goal of effectively transferring knowledge to relevant domains. The applied TL-based network models are AlexNet, VGG-16, and Inception V3 models, as shown in Fig. 4.

In Fig. 4, AlexNet consists of 5 convolutional layers and three fully connected layers. Max-pooling layers are placed between the convolutional layers to reduce the size of feature maps. It uses the ReLU activation function and applies Dropout in the first two fully connected layers to prevent overfitting. The final fully connected layer uses softmax for classification. VGG-16, on the other hand, comprises 13 convolutional layers and three fully connected layers. Similar to AlexNet, it employs max-pooling layers between convolutional layers. VGG-16 is characterized by its simple hierarchical structure, where all convolutional and fully connected layers use the same-sized convolutional kernels (typically 3 × 3 kernels). It employs the ReLU activation function and enhances the network’s non-linearity by stacking multiple convolutional layers, thereby improving image classification performance. Inception-V3 consists of multiple Inception modules, each containing different-sized convolutional kernels, such as 1 × 1, 3 × 3, and 5 × 5 kernels, as well as pooling layers. It uses Dropout to prevent overfitting. The design of Inception modules allows the network to have greater width and depth, effectively capturing features at various scales and levels, resulting in outstanding performance in image classification and recognition tasks. All three of these models have achieved significant success in the field of computer vision. TL can leverage the pre-trained model weights of these models and adapt them to specific tasks through fine-tuning, thereby improving the model’s performance on new tasks. These models have been widely applied in various image processing tasks, including image classification, object detection, and image segmentation. In this paper, these three TL-based models are applied to PLIR to accurately select the desired plants for landscape design through improved PLIR.

This paper proposes a domain-adaptive TL strategy, specifically designed for PLIR tasks. In addition to the traditional fine-tuning operations, the strategy involves multi-level fine-tuning of the visual features of plant leaves. Specifically, during the TL process, different network layers are fine-tuned for the image features of different plant categories, further enhancing the model’s performance in plant classification tasks. Compared to traditional TL methods, this strategy effectively improves the model’s accuracy and robustness in new domains, such as PLIR in landscape design. To improve the model’s accuracy in recognizing rare landscape elements, a region-weighted loss function is introduced. In this loss function, different weights are assigned to landscape elements based on their rarity. For rare landscape elements (such as certain plant species), higher weights are given, ensuring that the model focuses more on recognizing these rare elements during training. This approach helps the model effectively address the class imbalance problem in landscape element recognition, thereby enhancing the overall recognition accuracy.

In the experiment, all samples in the plant leaf image dataset are randomly horizontally and vertically flipped, resulting in a dataset expanded by three times its original size. Subsequently, the expanded dataset is divided into training and testing sets proportionately. Then, the pre-trained parameters of the AlexNet, VGG-16, and Inception-V3 models on ImageNet are transferred to the plant leaf image dataset, with only the last fully connected layer being replaced.

Landscape elements are divided into six categories: sky, water, mountains, animals and plants, buildings, and roads. The FCN is compared with the actual results, the model is analyzed for performance accuracy, and the pixel error in the landscape image is counted. Then, the result of the feature classification in the landscape image is judged. The overall precision of the model is calculated, as shown in Eq. (3).

In Eq. (3), refers to the number of pixels belonging to class a and judged correctly. is the total number of pixels belonging to class an element, is the number of pixels belonging to class a but judged to be class b elements. The accuracy of class a landscape element is that they belong to class a landscape element, and those results are correct, as shown in Eq. (4).

The calculation method of average accuracy Q is shown in Eq. (5).

In Eq. (5), is the total number of landscape element categories, . IoU is used to predict the intersection of the correct landscape and the union of category and original pixels, as shown in Eq. (6).

The performance of the TL-based PLIR model is evaluated by the accuracy of the test set, as shown in Eq. (7).

In Eq. (7), M is the number of correct plant leaf images tested. N is the total number of leaf images of tested plants.

The analysis of landscape design and plant selection under deep learning – Scientific Reports

Like this:

Related

Share this:

Like this:

Related

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.