|
For this paper, we used three datasets: Deep Grasping, ImageNet, and HandCam, which are all described in more detail below. These datasets were adapted or created to test the learning of grasps for a prosthetic hand given that an embedded camera sees an object in its field-of-view. More specifically, each dataset consists of close-range images of objects that can be picked up by a human hand.
Annotations: Each image was hand-annotated to be one of five grasps: power, tool, key, pinch, three-jaw-chuck. The annotations are provided in json format as follows:
{
"[image_name_0]":
{
"grip": "[grip_type]",
"comment": "[comment_text]"
},
"[image_name_1]":
{
"grip": "[grip_type]",
"comment": "[comment_text]"
},
etc...
}
[image_name] |
- |
Name of annotated image |
[grip_type] |
- |
one of five values: "3 jaw chuck", "key", "pinch", "power", or "tool" |
[comment_text] |
- |
space for any comments during annotation |
Deep Grasping: This dataset was published by the Robot Learning Lab at Cornell by Ian Lenz, Honglak Lee, and Ashutosh Saxena as part of their paper "Deep Learning for Detecting Robotic Grasps". More information and their dataset can be downloaded from here: http://pr.cs.cornell.edu/deepgrasping/. In case the data is changed with their future work or removed, we include here the data and annotations that we used for our experiments.
Below are some sample images from this dataset. All images are taken from roughly the same perspective with an object on a white background. All images have a resolution of 640 x 480.
Below is a table of the bias in the dataset based on our annotations. More specifically, it is the percentage each grasp is represented in this dataset. As can be seen, the representation of each grasp is not equal. This motivated us to collect additional data.
|
|
Key
| Pinch
| Power
| 3 Jaw Chuck
| Tool
|
Bias |
0.0 % |
21.8 % |
47.0 % |
28.0 % |
3.2 % |
|
ImageNet: ImageNet is a large, popular dataset in the computer vision community that was originally developed for object recognition and detection. The dataset consists of over 14 million images with annotated objects and bounding boxes. More information about ImageNet can be found at http://image-net.org/index. Since we want to learn how to classify between five grasps, we downloaded images for 25 common graspable objects: Ball, Basket, Blowdryer, Bowl, Calculator, Camera, Can, Cup, Deodorant, Flashlight, Glassware, Keys, Lotion, Medicine, Miscellaneous, Mugs, Paper, Pen, Remote, Scissors, Shears, Shoes, Stapler, Tongs, and Utensils. The images were curated based on a preference for close up views of real objects (trying to avoid computer generated images) which resulted in a final image count of 5,180. These images were then annotated into one of the five grasps.
Below are some sample images from our curated portion of the ImageNet dataset. As can be seen, the images are taken from a variety of viewpoints and lighting conditions. The resolution of the images varies.
Below is a table of the bias in the dataset based on our annotations. More specifically, it is the percentage each grasp is represented in this dataset. As can be seen, the representation of each grasp is not equal; however, it is closer to uniform when compared to the Deep Grasping annotations.
|
|
Key
| Pinch
| Power
| 3 Jaw Chuck
| Tool
|
Bias |
11.8 % |
10.6 % |
47.5 % |
19.2 % |
10.9 % |
|
HandCam: The HandCam dataset was created to test how well our algorithm worked for grasp selection from the camera in our prosthetic hand. We trained our classifiers using the ImageNet dataset above and only used the HandCam dataset for testing. For each of the five grasps, ten objects were chosen and photographed from five different perspectives. This results in a total of 50 images per grasp and a total of 250 grasps total; which also results in an equal representation of each grasp.
Below are some sample images from our newly created HandCam dataset. The images are taken from the camera on the hand, which was a PointGray FireflyMV USB 2.0 with an output resolution of 640 x 480 pixels.
Below is a table of the bias in the dataset based on our annotations. More specifically, it is the percentage each grasp is represented in this dataset. As can be seen, the representations are uniform by design.
|
|
Key
| Pinch
| Power
| 3 Jaw Chuck
| Tool
|
Bias |
20.0 % |
20.0 % |
20.0 % |
20.0 % |
20.0 % |
|
|
|